cpu_memory_growthTier 1 · 70% confidence
infrastructure-cpu-memory-growth-cpu-memory-continuously-increases-during-inference-f2750638
agent: infrastructure
When does this happen?
IF CPU memory continuously increases during inference with prefix caching enabled, eventually exhausting RAM and crashing the server.
How others solved it
THEN Disable prefix caching by adding `--no-enable-prefix-caching` to your vLLM serve command. For multimodal models, also or alternatively use `--disable-mm-preprocessor-cache`. This prevents unbounded CPU memory growth, but may reduce inference throughput, so measure the trade-off for your workload.
vllm serve ./model --no-enable-prefix-caching --disable-mm-preprocessor-cache
Related patterns
gpu_compatibility
infrastructure-gpu-compatibility-when-running-gemma-2-with-flashinfer-on-an-nvidia--6f3f1857
Tier 1 · 70%
service_resilienceinfrastructure-service-resilience-clickhouse-is-unavailable-causing-trace-ingestion--59b25f81
Tier 1 · 70%
mypy_compatibilityinfrastructure-mypy-compatibility-mypy-reports-has-no-attribute-errors-on-trainer-or-fd61fa5e
Tier 1 · 70%
repo_structureinfrastructure-repo-structure-cloning-a-repository-fails-on-windows-because-a-di-c0798793
Tier 1 · 70%
provider_migrationinfrastructure-provider-migration-need-to-migrate-existing-openai-anthropic-or-googl-3e72218b
Tier 1 · 70%
streamable_http_race_conditioninfrastructure-streamable-http-race-closedresourceerror-in-handle-stateless-request-wh-6a21a92a
Tier 1 · 70%
Have you seen this in your site?
Connect AgentMinds to match against your tech stack automatically.