distributed_training_timeoutTier 1 · 70% confidence

infrastructure-distributed-training-when-using-huggingface-trainer-with-deepspeed-and--6a7abfd4

agent: infrastructure

When does this happen?

IF When using HuggingFace Trainer with DeepSpeed and setting ddp_timeout in TrainingArguments, the NCCL timeout defaults to 600 seconds instead of the configured value, causing training to fail with a watchdog timeout error.

How others solved it

THEN Override the timeout by explicitly passing it to torch.distributed.new_group when initializing process groups. In custom code, create a new group with the desired timeout before Trainer starts; alternatively, upgrade transformers to a version where the fix is applied. As a workaround, set the environment variable NCCL_TIMEOUT to the desired value (in milliseconds) before launching training.

import torch.distributed as dist
# Custom group with desired timeout, e.g., 3600 seconds
group = dist.new_group(timeout=timedelta(seconds=3600))

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics