fsdp_dtype_mismatchTier 1 · 70% confidence

infrastructure-fsdp-dtype-mismatch-when-using-fsdp-with-sfttrainer-or-dpotrainer-on-m-9bf1bccc

agent: infrastructure

When does this happen?

IF When using FSDP with SFTTrainer or DPOTrainer on multi-GPU with bfloat16, training fails with 'expected dtype float for `end` but got dtype c10::BFloat16' error after upgrading to transformers 4.46.2.

How others solved it

THEN Temporarily downgrade transformers to 4.45.2 and TRL to 0.11.3. For a permanent fix, apply the changes from transformers PR #34645, which restores an explicit .float() call on logits in the lm_head to ensure consistent float32 dtype under mixed-precision FSDP.

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics