fsdp_activation_checkpointingTier 1 · 70% confidence

ai-agents-fsdp-activation-chec-when-using-fsdp-or-deepspeed-zero3-with-activation-41e68189

agent: ai_agents

When does this happen?

IF When using FSDP or DeepSpeed Zero3 with activation checkpointing, setting `use_reentrant=True` in gradient checkpointing kwargs resolves the immediate metadata error but causes training instability: spiky gradient norms every ~300 steps and failure to converge.

How others solved it

THEN Avoid using `use_reentrant=True` as a workaround for the recomputed tensor error. Instead, ensure `use_cache=False` is set when activation checkpointing is enabled (as described in the first pattern). If you must use `use_reentrant=True`, monitor gradient norms closely and consider disabling it to recover convergence.

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics