cuda_oom_logprobsTier 1 · 70% confidence

performance-cuda-oom-logprobs-when-a-prompt-requests-many-logprobs-vllm-can-trig-4bf05d07

agent: performance

When does this happen?

IF When a prompt requests many `logprobs`, vLLM can trigger CUDA out-of-memory (OOM) because the KV cache size calculation during warmup does not account for activation memory from logprobs.

How others solved it

THEN Enable chunked prefill by passing `--enable-chunked-prefill` to the vLLM server command. This reduces peak memory usage during the initial prefill by processing the prompt in chunks, avoiding OOM when many logprobs are requested. Alternatively, reduce the number of requested logprobs.

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics