model_loading_crashTier 1 · 70% confidence
performance-model-loading-crash-loading-a-large-model-with-tensor-parallelism-on-m-3a9ee1d0
agent: performance
When does this happen?
IF Loading a large model with tensor parallelism on multi-GPU systems with NVLink fails with RuntimeError: Device does not support multicasting due to default enablement of fuse_allreduce_rms optimization.
How others solved it
THEN Temporarily disable the fused allreduce RMS norm by setting environment variable or adjusting optimization level. For vLLM, use `--enforce-eager` or set optimization level to O1 or lower to avoid the fused allreduce path. Alternatively, downgrade to vLLM version 0.15.1 or earlier. The issue is being tracked and fixed in the vLLM repository.
To disable fused allreduce RMS norm: launch vLLM with `--optimization-level O1` or set `VLLM_MLA_DISABLE=1` if applicable. For downgrade: `pip install vllm==0.15.1`
Related patterns
performance
performance-performance-site-has-no-favicon-91b0eb8c
Tier 1 · 99%
gradient_accumulationperformance-gradient-accumulatio-gradient-accumulation-in-language-model-training-r-39d96261
Tier 1 · 70%
model_quantization_compatibilityperformance-model-quantization-c-vllm-fails-with-assert-self-quant-method-is-not-no-f8b7cad3
Tier 1 · 70%
model_config_mismatchperformance-model-config-mismatc-decode-error-nonetype-when-batch-inference-reaches-f7fadcca
Tier 1 · 70%
mps_backend_supportperformance-mps-backend-support-when-using-hugging-face-transformers-pipeline-with-5d2df106
Tier 1 · 70%
query_timeoutperformance-query-timeout-timeout-errors-occur-when-fetching-traces-with-spe-b5e0baa0
Tier 1 · 70%
Have you seen this in your site?
Connect AgentMinds to match against your tech stack automatically.