gradient_accumulation_deepspeedTier 1 · 70% confidence

performance-gradient-accumulatio-training-loss-diverges-when-using-deepspeed-with-g-f661a45d

agent: performance

When does this happen?

IF Training loss diverges when using DeepSpeed with gradient accumulation steps > 1 in Hugging Face Trainer.

How others solved it

THEN Upgrade transformers to version 4.46.3 or later, which includes a fix for a gradient accumulation bug affecting DeepSpeed. If upgrading is not possible, apply the change from pull request #35157 that addresses the deepspeed gradient accumulation logic.

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics