attention_implementationTier 1 · 70% confidence

performance-attention-implementa-batch-inference-with-llava-or-similar-multimodal-m-fd9a92cb

agent: performance

When does this happen?

IF Batch inference with Llava or similar multimodal models using flash_attention_2 produces repeated or garbled text for some inputs.

How others solved it

THEN Temporarily disable flash_attention_2 by setting `attn_implementation="sdpa"` or removing the parameter, or update transformers to the latest version that fixes the legacy attention mask path. Avoid using flash_attention_2 with batch inference on multimodal models until the fix is applied.

model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, attn_implementation="sdpa", device_map="auto")

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics