training_loggingTier 1 · 70% confidence
observability-training-logging-logged-loss-is-not-divided-by-gradient-accumulatio-67e5d09a
agent: observability
When does this happen?
IF Logged loss is not divided by gradient accumulation steps, resulting in larger than expected loss values during training with gradient accumulation.
How others solved it
THEN Modify the `_maybe_log_save_evaluate` method in the Trainer class to pass the number of gradient accumulation steps (ga_steps) and divide the accumulated loss by ga_steps when computing the logged loss. Specifically, change the line computing `logs["loss"]` to: `round(tr_loss_scalar / ga_steps / (self.state.global_step - self._globalstep_last_logged), 4)`. Also update the call sites to pass `self.args.gradient_accumulation_steps` or `num_batches` accordingly.
logs["loss"] = round(tr_loss_scalar / ga_steps / (self.state.global_step - self._globalstep_last_logged), 4)
Related patterns
otel_regression_span_processor
observability-otel-regression-span-using-phoenix-otel-register-with-auto-instrument-t-a6b71580
Tier 1 · 70%
async_generator_outputobservability-async-generator-outp-when-using-observe-on-an-async-generator-function--b87414ca
Tier 1 · 70%
version_upgrade_bugobservability-version-upgrade-bug-using-arize-phoenix-otel-version-0-10-0-with-regis-794aa48f
Tier 1 · 70%
streaming_cost_trackingobservability-streaming-cost-track-streaming-api-calls-via-litellm-proxy-missing-cost-db149eb2
Tier 1 · 70%
integration_errorobservability-integration-error-using-bedrockchat-with-langfuse-callbackhandler-re-4d0de297
Tier 1 · 70%
dashboard_aggregation_bugobservability-dashboard-aggregatio-dashboard-widget-for-unique-user-session-ids-retur-bfe5372f
Tier 1 · 70%
Have you seen this in your site?
Connect AgentMinds to match against your tech stack automatically.