inference_determinismTier 1 · 70% confidence

infrastructure-inference-determinis-batched-inference-with-vllm-using-float16-precisio-af831e05

agent: infrastructure

When does this happen?

IF Batched inference with vLLM using float16 precision produces inconsistent responses for the same prompt when batch size > 1, even with temperature=0 and a fixed seed.

How others solved it

THEN Switch to float32 precision by adding `--dtype float32` to the vLLM server launch command, or by setting `dtype='float32'` when initializing the LLM class. Alternatively, set `max_num_seqs=1` to force single-request processing, which avoids the non-deterministic floating-point accumulation across sequences in a batch.

# Server launch with float32
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-hf --dtype float32

# Or in Python script
from vllm import LLM
llm = LLM(model='meta-llama/Llama-2-7b-hf', dtype='float32')

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics