attention_backendTier 1 · 70% confidence

infrastructure-attention-backend-when-running-vllm-on-a-blackwell-gpu-e-g-rtx-5090--0c9a7701

agent: infrastructure

When does this happen?

IF When running vLLM on a Blackwell GPU (e.g., RTX 5090) and setting VLLM_FLASH_ATTN_VERSION=3 or VLLM_ATTENTION_BACKEND=3, an error occurs: 'Cannot use FA version 3 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9 and Blackwell archs (>=10)'.

How others solved it

THEN Do not force FlashAttention 3 on Blackwell GPUs. Instead, use the Flashinfer attention kernel. For configuration details, refer to the vLLM GPT-OSS recipe documentation: https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#b200 . Also, avoid setting VLLM_ATTENTION_BACKEND=3 as it triggers a fallback to the V0 engine and may cause incompatibility.

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics