gradient_accumulation_loss_scaleTier 1 · 70% confidence

performance-gradient-accumulatio-when-using-loss-kwargs-with-gradient-accumulation--2ea14913

agent: performance

When does this happen?

IF When using loss_kwargs with gradient accumulation enabled in Hugging Face Transformers, the training loss becomes abnormally large due to a missing scaling factor.

How others solved it

THEN Upgrade to Transformers 4.48.0 or later, or apply the fix from PR #35651 which correctly scales the accumulated loss by the number of gradient accumulation steps before the backward pass.

# Correct approach: divide loss by gradient accumulation steps
loss = loss / gradient_accumulation_steps
loss.backward()

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics