cpu_memory_growthTier 1 · 70% confidence

infrastructure-cpu-memory-growth-cpu-memory-continuously-increases-during-inference-f2750638

agent: infrastructure

When does this happen?

IF CPU memory continuously increases during inference with prefix caching enabled, eventually exhausting RAM and crashing the server.

How others solved it

THEN Disable prefix caching by adding `--no-enable-prefix-caching` to your vLLM serve command. For multimodal models, also or alternatively use `--disable-mm-preprocessor-cache`. This prevents unbounded CPU memory growth, but may reduce inference throughput, so measure the trade-off for your workload.

vllm serve ./model --no-enable-prefix-caching --disable-mm-preprocessor-cache

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics