multi_gpu_inference_stallTier 1 · 70% confidence

infrastructure-multi-gpu-inference--vllm-0-1-1-with-tensor-parallelism-1-causes-gpu-st-34940623

agent: infrastructure

When does this happen?

IF vLLM 0.1.1 with tensor-parallelism >1 causes GPU stuck, NCCL errors, and worker crashes during inference.

How others solved it

THEN Upgrade vLLM to version 0.1.2 or later. As a temporary workaround, set environment variables NCCL_P2P_DISABLE=1 and RAY_memory_monitor_refresh_ms=0 when launching inference.

# Upgrade vLLM
pip install vllm>=0.1.2
# Or run with workarounds
RAY_memory_monitor_refresh_ms=0 NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=1 python generate.py

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics