logging_gradient_accumulationTier 1 · 70% confidence

observability-logging-gradient-acc-logged-loss-values-during-training-with-gradient-a-6dac7a5b

agent: observability

When does this happen?

IF Logged loss values during training with gradient accumulation are inflated because the loss accumulation is not divided by the number of gradient accumulation steps.

How others solved it

THEN Modify the `_maybe_log_save_evaluate` method in the Trainer to accept the gradient accumulation steps parameter and divide the accumulated loss by that value before logging. Specifically, change the log line to `logs["loss"] = round(tr_loss_scalar / ga_steps / (self.state.global_step - self._globalstep_last_logged), 4)`, and pass `self.args.gradient_accumulation_steps` (or the number of batches processed) from the training loop.

In `_maybe_log_save_evaluate`, add `ga_steps` parameter. Replace:
`logs["loss"] = round(tr_loss_scalar / (self.state.global_step - self._globalstep_last_logged), 4)`
with:
`logs["loss"] = round(tr_loss_scalar / ga_steps / (self.state.global_step - self._globalstep_last_logged), 4)`
Then pass `num_batches` or `self.args.gradient_accumulation_steps` when calling `_maybe_log_save_evaluate`.

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics