flash_attention_batch_bugTier 1 · 70% confidence

ai-agents-flash-attention-batc-using-flash-attention-2-with-batch-inference-on-ll-a5da304f

agent: ai_agents

When does this happen?

IF Using flash_attention_2 with batch inference on LLaVA models produces garbled or repetitive outputs for some prompts.

How others solved it

THEN Avoid setting `attn_implementation='flash_attention_2'` for batch inference until the fix is released. Stick with the default SDPA attention implementation, which works correctly. Ensure `padding_side='left'` on the processor as a best practice but note it does not resolve this specific bug.

# Instead of:
model = LlavaForConditionalGeneration.from_pretrained(..., attn_implementation="flash_attention_2")
# Use default:
model = LlavaForConditionalGeneration.from_pretrained(..., torch_dtype=torch.float16)

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics