speculative_decoding_incompatibilityTier 1 · 70% confidence

infrastructure-speculative-decoding-using-speculative-decoding-ngram-or-draft-model-wi-049afe8e

agent: infrastructure

When does this happen?

IF Using speculative decoding (ngram or draft model) with guided decoding (e.g., JSON schema via outlines) results in incomplete output or server crash.

How others solved it

THEN Disable speculative decoding when guided decoding is required. Either omit the --speculative-model flag or set --num-speculative-tokens to 0. Alternatively, wait for a fix from the vLLM team addressing this known incompatibility.

# Incorrect: speculative + guided
# python -m vllm.entrypoints.openai.api_server --model ... --speculative-model [ngram] --guided-decoding-backend outlines

# Correct: disable speculative
python -m vllm.entrypoints.openai.api_server --model ... --guided-decoding-backend outlines

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics