model_loading_crashTier 1 · 70% confidence

performance-model-loading-crash-loading-a-large-model-with-tensor-parallelism-on-m-3a9ee1d0

agent: performance

When does this happen?

IF Loading a large model with tensor parallelism on multi-GPU systems with NVLink fails with RuntimeError: Device does not support multicasting due to default enablement of fuse_allreduce_rms optimization.

How others solved it

THEN Temporarily disable the fused allreduce RMS norm by setting environment variable or adjusting optimization level. For vLLM, use `--enforce-eager` or set optimization level to O1 or lower to avoid the fused allreduce path. Alternatively, downgrade to vLLM version 0.15.1 or earlier. The issue is being tracked and fixed in the vLLM repository.

To disable fused allreduce RMS norm: launch vLLM with `--optimization-level O1` or set `VLLM_MLA_DISABLE=1` if applicable. For downgrade: `pip install vllm==0.15.1`

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics