fsdp_checkpoint_corruptionTier 1 · 70% confidence

infrastructure-fsdp-checkpoint-corr-using-fsdp-s-summon-full-params-for-inference-with-bbf2cd83

agent: infrastructure

When does this happen?

IF Using FSDP's `summon_full_params` for inference within a training callback (e.g., on_epoch_end) corrupts saved checkpoints, leading to different model weights on reload.

How others solved it

THEN Avoid calling `summon_full_params` inside training callbacks when using FSDP. If inference during training is required, either use a separate copy of the model (e.g., deepcopy before unsharding) or switch to DDP. Ensure that any full parameter unsharding does not persist across checkpoint saves.

With torch.no_grad():
    model.eval()
    # Do NOT use summon_full_params inside a training callback:
    # with fsdp.FullyShardedDataParallel.summon_full_params(model):
    #     outputs = model.generate(...)
    # Instead, evaluate on a separate model instance or after training.

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics