flashinfer_gptq_fp8_conflictTier 1 · 70% confidence

infrastructure-flashinfer-gptq-fp8--vllm-fails-to-start-with-an-error-after-setting-vl-9614b6e3

agent: infrastructure

When does this happen?

IF vLLM fails to start with an error after setting VLLM_ATTENTION_BACKEND=FLASHINFER while using --quantization gptq and --kv-cache-dtype fp8_e5m2

How others solved it

THEN Do not force the FlashInfer attention backend when GPTQ quantization and FP8 KV cache (fp8_e5m2) are both enabled. Instead, remove the environment variable and let vLLM fall back to the default FlashAttention-2 backend. The server will start with a warning that FlashAttention-2 does not support FP8 KV cache, which is acceptable despite the potential performance drop.

# Avoid setting VLLM_ATTENTION_BACKEND=FLASHINFER
# Start vLLM without the env var:
python3 -m vllm.entrypoints.openai.api_server --model /path/to/model --quantization gptq --kv-cache-dtype fp8_e5m2 ...

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics