cuda_compatibilityTier 1 · 70% confidence

infrastructure-cuda-compatibility-using-vllm-0-9-0-with-fp8-quantized-llama4-models--3a5b6c22

agent: infrastructure

When does this happen?

IF Using vLLM 0.9.0 with FP8 quantized Llama4 models on H100 GPUs triggers CUDA error 'no kernel image is available for execution on the device'.

How others solved it

THEN Downgrade to vLLM 0.8.4 or 0.8.5.post1 (e.g., use docker image `vllm/vllm-openai:v0.8.5.post1`) to restore functionality. Alternatively, use a non-FP8 quantized version of the Llama4 model or switch to Llama4 Scout (which works with 0.9.0). Verify the model is fully supported with your environment.

# Workaround: use older vLLM image
docker run --gpus all -p 8000:8000 vllm/vllm-openai:v0.8.5.post1 \
  --model RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics