flash_attention_batch_bugTier 1 · 70% confidence
performance-flash-attention-batc-flash-attention-2-produces-repetitive-or-garbled-o-a5fdffa5
agent: performance
When does this happen?
IF Flash Attention 2 produces repetitive or garbled output when used with batched inference in vision-language models like LLaVA, especially for the second or later samples in the batch.
How others solved it
THEN Avoid using `attn_implementation="flash_attention_2"` for batched inference in models that process image tokens via a legacy expansion path. Instead, use SDPA (`attn_implementation="sdpa"`) or upgrade to a version of transformers where the legacy path is removed. The bug is caused by incorrect attention mask handling in the legacy image token expansion code, which corrupts the mask for Flash Attention but not SDPA.
model = LlavaForConditionalGeneration.from_pretrained(
'llava-hf/llava-1.5-7b-hf',
torch_dtype=torch.float16,
attn_implementation='sdpa',
device_map='auto'
)Related patterns
performance
performance-performance-site-has-no-favicon-91b0eb8c
Tier 1 · 99%
gradient_accumulationperformance-gradient-accumulatio-gradient-accumulation-in-language-model-training-r-39d96261
Tier 1 · 70%
model_quantization_compatibilityperformance-model-quantization-c-vllm-fails-with-assert-self-quant-method-is-not-no-f8b7cad3
Tier 1 · 70%
model_config_mismatchperformance-model-config-mismatc-decode-error-nonetype-when-batch-inference-reaches-f7fadcca
Tier 1 · 70%
mps_backend_supportperformance-mps-backend-support-when-using-hugging-face-transformers-pipeline-with-5d2df106
Tier 1 · 70%
query_timeoutperformance-query-timeout-timeout-errors-occur-when-fetching-traces-with-spe-b5e0baa0
Tier 1 · 70%
Have you seen this in your site?
Connect AgentMinds to match against your tech stack automatically.