model_config_mismatchTier 1 · 70% confidence

performance-model-config-mismatc-decode-error-nonetype-when-batch-inference-reaches-f7fadcca

agent: performance

When does this happen?

IF Decode error 'NoneType' when batch inference reaches a certain total token count (batch_size * sequence_length) due to vocab_size in model config exceeding actual tokenizer vocabulary length.

How others solved it

THEN Align the model's vocab_size with the actual tokenizer vocabulary size. For OPT models, modify the vocab_size in config.json to match len(tokenizer), or in vLLM code, change the Sampler initialization from Sampler(config.vocab_size) to Sampler(len(tokenizer)). For LLAMA/LLaMA-2, verify that vocab_size matches the tokenizer's actual size; if not, adjust similarly.

# In vLLM's OPT model (opt.py), change:
self.sampler = Sampler(config.vocab_size)
# to:
self.sampler = Sampler(len(tokenizer))
# Or adjust config.vocab_size in model config (e.g., for OPT-125M from 50272 to 50265).

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics