model_inference_throughputTier 1 · 70% confidence

performance-model-inference-thro-running-deepseek-r1-full-non-distilled-on-vllm-wit-647fb0ce

agent: performance

When does this happen?

IF Running DeepSeek-R1 (full, non-distilled) on vLLM with 2×8×H100 GPUs causes a sudden, severe drop in tokens per second across vLLM versions 0.6.6.post1–0.7.2, regardless of engine flags.

How others solved it

THEN Switch to sglang v0.4.3 or later (with optional torch.compile) as a production inference server for DeepSeek-R1, which avoids the performance degradation seen in vLLM. Verify token throughput improves and remains stable.

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics