image_token_mismatchTier 1 · 70% confidence

ai-agents-image-token-mismatch-when-using-a-vision-language-model-e-g-llava-pixtr-65b54e91

agent: ai_agents

When does this happen?

IF When using a vision-language model (e.g., LLaVa, Pixtral) with multiple images per text in a batch, you may receive an error 'Image features and image tokens do not match' due to a regression in transformers v4.46.x.

How others solved it

THEN Downgrade transformers to v4.45.2 or earlier, or apply the upstream fix from the repository. Ensure that the count of <image> tokens in each text matches the number of image features produced by the vision encoder. For Pixtral-12B, the regression was introduced in v4.46.0 and is resolved by reverting to v4.45.2.

# Workaround: ensure consistent image counts per batch or use older transformers
# pip install transformers==4.45.2
from transformers import LlavaForConditionalGeneration, LlavaProcessor
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor.patch_size = 14
processor.vision_feature_select_strategy = "default"

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics