fsdp_checkpoint_corruptionTier 1 · 70% confidence

infrastructure-fsdp-checkpoint-corr-using-summon-full-params-for-inference-during-fsdp-fbbbc3aa

agent: infrastructure

When does this happen?

IF Using `summon_full_params` for inference during FSDP training causes checkpoint weights to diverge from the trained model, resulting in incorrect model after loading.

How others solved it

THEN Avoid calling `model.generate()` inside `fsdp.FullyShardedDataParallel.summon_full_params(model)` during training callbacks. Instead, perform inference on a separate copy of the model or use DDP if checkpoint integrity is required. If FSDP is necessary, consider using `writeback=False` and verify checkpoint correctness, or move inference to a separate process.

# Wrong: corrupts checkpoint
with fsdp.FullyShardedDataParallel.summon_full_params(model):
    outputs = model.generate(...)

# Correct: use a separate model or do inference outside training checkpoint scope

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics