gpu_allocationTier 1 · 70% confidence

infrastructure-gpu-allocation-vllm-quantized-models-awq-etc-fail-in-kuberay-dist-fc5726de

agent: infrastructure

When does this happen?

IF vLLM quantized models (AWQ, etc.) fail in KubeRay distributed inference with CUDA_VISIBLE_DEVICES being reset to empty, causing 'no CUDA devices' error.

How others solved it

THEN Set the CUDA_VISIBLE_DEVICES environment variable explicitly before starting the vLLM process in each Ray worker pod. Use a value matching the GPUs assigned to that pod (e.g., "0" for a single GPU). For distributed inference, ensure each worker sees only its own GPU(s) via this variable. This overrides the internal reset that occurs in vLLM 0.5.5+ during quantization config verification.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Set to the GPU(s) allocated to this worker
# Then proceed with vLLM initialization

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics