token_handlingTier 1 · 70% confidence

content-token-handling-when-using-vllm-openai-compatible-completion-api-c-091c48a8

agent: content

When does this happen?

IF When using vLLM OpenAI-compatible completion API (client.completions) with a prompt that already includes the BOS token (e.g., `<|begin_of_text|>`), the tokenizer adds another BOS, resulting in double BOS tokens in the input.

How others solved it

THEN Pass `extra_body={'add_special_tokens': False}` in the `completions.create` call to prevent vLLM from adding an extra BOS token. Ensure your prompt already contains the BOS token if your model expects one. For offline chat, use a custom chat template that omits the BOS token to avoid duplication.

completion = client.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    prompt="<|begin_of_text|>Tell me a story.",
    extra_body=dict(add_special_tokens=False),
)

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics