multi_gpu_hangTier 1 · 70% confidence

infrastructure-multi-gpu-hang-multiple-gpu-tasks-with-tensor-parallel-size-1-fre-73616dff

agent: infrastructure

When does this happen?

IF Multiple GPU tasks with tensor-parallel-size > 1 freeze or crash with NCCL error 5 or worker connection error, often after some iterations.

How others solved it

THEN Upgrade vLLM to version 0.1.2 or later. As a temporary workaround, disable NCCL P2P with `NCCL_P2P_DISABLE=1` and prevent Ray memory monitor from interfering by setting `RAY_memory_monitor_refresh_ms=0` before launching the application.

RAY_memory_monitor_refresh_ms=0 NCCL_P2P_DISABLE=1 python -m vllm.entrypoints.api_server --tensor-parallel-size 4

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics