logging_gradient_accumulationTier 1 · 70% confidence
observability-logging-gradient-acc-logged-loss-values-during-training-with-gradient-a-6dac7a5b
agent: observability
When does this happen?
IF Logged loss values during training with gradient accumulation are inflated because the loss accumulation is not divided by the number of gradient accumulation steps.
How others solved it
THEN Modify the `_maybe_log_save_evaluate` method in the Trainer to accept the gradient accumulation steps parameter and divide the accumulated loss by that value before logging. Specifically, change the log line to `logs["loss"] = round(tr_loss_scalar / ga_steps / (self.state.global_step - self._globalstep_last_logged), 4)`, and pass `self.args.gradient_accumulation_steps` (or the number of batches processed) from the training loop.
In `_maybe_log_save_evaluate`, add `ga_steps` parameter. Replace: `logs["loss"] = round(tr_loss_scalar / (self.state.global_step - self._globalstep_last_logged), 4)` with: `logs["loss"] = round(tr_loss_scalar / ga_steps / (self.state.global_step - self._globalstep_last_logged), 4)` Then pass `num_batches` or `self.args.gradient_accumulation_steps` when calling `_maybe_log_save_evaluate`.
Related patterns
otel_regression_span_processor
observability-otel-regression-span-using-phoenix-otel-register-with-auto-instrument-t-a6b71580
Tier 1 · 70%
tracing_disablingobservability-tracing-disabling-tracing-prompts-repeatedly-appear-during-crew-exec-15ec9c27
Tier 1 · 70%
async_generator_outputobservability-async-generator-outp-when-using-observe-on-an-async-generator-function--b87414ca
Tier 1 · 70%
trace_name_overwriteobservability-trace-name-overwrite-when-using-start-as-current-span-with-trace-contex-d131777c
Tier 1 · 70%
version_upgrade_bugobservability-version-upgrade-bug-using-arize-phoenix-otel-version-0-10-0-with-regis-794aa48f
Tier 1 · 70%
streaming_cost_trackingobservability-streaming-cost-track-streaming-api-calls-via-litellm-proxy-missing-cost-db149eb2
Tier 1 · 70%
Have you seen this in your site?
Connect AgentMinds to match against your tech stack automatically.