cuda_oomTier 1 · 70% confidence

performance-cuda-oom-requesting-many-logprobs-during-prefill-or-decodin-d5a9e3a0

agent: performance

When does this happen?

IF Requesting many logprobs during prefill or decoding causes CUDA out-of-memory (OOM) because logprobs memory is not considered during KV cache warmup.

How others solved it

THEN Use the `--enable-chunked-prefill` flag as a temporary workaround to avoid OOM when requesting many logprobs. Alternatively, reduce the number of requested logprobs or monitor memory usage. A permanent fix is being designed.

python -m vllm.entrypoints.openai.api_server --model <model> --enable-chunked-prefill

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics