ddp_timeout_deepspeedTier 1 · 70% confidence

infrastructure-ddp-timeout-deepspee-when-using-deepspeed-with-huggingface-trainer-the--a9014976

agent: infrastructure

When does this happen?

IF When using DeepSpeed with HuggingFace Trainer, the `ddp_timeout` argument in TrainingArguments is ignored, causing NCCL timeout errors after several training steps.

How others solved it

THEN Set the environment variable `NCCL_TIMEOUT` to a higher value (e.g., '3600000' for 1 hour) before training, or manually initialize the process group with a custom timeout using `torch.distributed.init_process_group(timeout=timedelta(seconds=3600))` to override the default 600s timeout. Alternatively, upgrade to a version of transformers where the fix is applied.

import os
os.environ['NCCL_TIMEOUT'] = '3600000'
# or in code:
import torch
import timedelta
torch.distributed.init_process_group(timeout=timedelta(seconds=3600))

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics