distributed_training_timeoutTier 1 · 70% confidence

infrastructure-distributed-training-when-using-deepspeed-with-transformers-trainer-the-5dae7a15

agent: infrastructure

When does this happen?

IF When using DeepSpeed with transformers Trainer, the ddp_timeout parameter is ignored and NCCL collectives timeout after 600 seconds.

How others solved it

THEN Set the environment variable NCCL_TIMEOUT to the desired timeout in milliseconds before launching the training script (e.g., NCCL_TIMEOUT=3600000). Alternatively, manually initialize the distributed process group with the correct timeout using torch.distributed.init_process_group(..., timeout=timedelta(seconds=3600)) before constructing the Trainer.

import os
os.environ['NCCL_TIMEOUT'] = '3600000'  # milliseconds

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics