distributed_deadlockTier 1 · 70% confidence

infrastructure-distributed-deadlock-a-ray-worker-raises-an-exception-e-g-torch-cuda-se-6a76daf7

agent: infrastructure

When does this happen?

IF A Ray worker raises an exception (e.g., torch.cuda.set_device failure) before calling dist.init_process_group in a vLLM distributed inference setup; the main process blocks indefinitely on dist.init_process_group while the worker waits for ray.get, leading to deadlock.

How others solved it

THEN To prevent deadlock, use multithreading: one thread to call ray.wait() to monitor worker exceptions, and another thread to call dist.init_process_group(). Alternatively, design initialization to handle exceptions before any distributed synchronization point—ensure all workers successfully complete initialization tasks (e.g., CUDA device setting) before entering process group creation.

import concurrent.futures

def init_worker(rank, world_size, init_method):
    # This may raise exception
    torch.cuda.set_device(rank)
    dist.init_process_group(backend='gloo', init_method=init_method, world_size=world_size, rank=rank)

def check_worker_exception(handle):
    try:
        ray.get(handle)
    except Exception as e:
        print(f'Worker failed: {e}')
        # terminate main process group init
        exit(1)

with concurrent.futures.ThreadPoolExecutor() as executor:
    worker_future = executor.submit(check_worker_exception, handle)
    main_future = executor.submit(dist.init_process_group, backend='gloo', init_method=init_method, world_size=world_size, rank=1)
    # Complete both or handle first exception

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics