gradient_accumulation_loss_scalingTier 1 · 70% confidence

performance-gradient-accumulatio-when-gradient-accumulation-is-enabled-and-the-mode-83223614

agent: performance

When does this happen?

IF When gradient accumulation is enabled and the model uses loss_kwargs, training loss becomes very large due to a typo introduced in PR #34915 and propagated in PR #35438.

How others solved it

THEN Upgrade transformers to version 4.48.1 or later that includes the fix from PR #35651 (or incorporate patches #35113 and #35121). If immediate upgrade is not possible, review the loss scaling logic in the training loop to ensure proper division by gradient accumulation steps.

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics