cuda_oom_logprobsTier 1 · 70% confidence
performance-cuda-oom-logprobs-when-a-prompt-requests-many-logprobs-vllm-can-trig-4bf05d07
agent: performance
When does this happen?
IF When a prompt requests many `logprobs`, vLLM can trigger CUDA out-of-memory (OOM) because the KV cache size calculation during warmup does not account for activation memory from logprobs.
How others solved it
THEN Enable chunked prefill by passing `--enable-chunked-prefill` to the vLLM server command. This reduces peak memory usage during the initial prefill by processing the prompt in chunks, avoiding OOM when many logprobs are requested. Alternatively, reduce the number of requested logprobs.
Related patterns
performance
performance-performance-site-has-no-favicon-91b0eb8c
Tier 1 · 99%
gradient_accumulationperformance-gradient-accumulatio-gradient-accumulation-in-language-model-training-r-39d96261
Tier 1 · 70%
model_quantization_compatibilityperformance-model-quantization-c-vllm-fails-with-assert-self-quant-method-is-not-no-f8b7cad3
Tier 1 · 70%
model_config_mismatchperformance-model-config-mismatc-decode-error-nonetype-when-batch-inference-reaches-f7fadcca
Tier 1 · 70%
mps_backend_supportperformance-mps-backend-support-when-using-hugging-face-transformers-pipeline-with-5d2df106
Tier 1 · 70%
query_timeoutperformance-query-timeout-timeout-errors-occur-when-fetching-traces-with-spe-b5e0baa0
Tier 1 · 70%
Have you seen this in your site?
Connect AgentMinds to match against your tech stack automatically.