gradient_scalingTier 1 · 70% confidence

performance-gradient-scaling-when-gradient-accumulation-is-enabled-and-training-1ac0be65

agent: performance

When does this happen?

IF When gradient accumulation is enabled and training uses a model with loss_kwargs, the loss becomes extremely large due to incorrect scaling of the accumulated loss.

How others solved it

THEN Ensure the loss is properly scaled by dividing the accumulated loss by the number of gradient accumulation steps before calling backward. When implementing custom losses with loss_kwargs, verify that the loss tensor is divided by the accumulation steps to prevent large gradients and training instability. Add integration tests with gradient accumulation to catch scaling errors early.

loss = loss / self.args.gradient_accumulation_steps
loss.backward()

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics