distributed_evaluation_crashTier 1 · 70% confidence
performance-distributed-evaluati-when-evaluating-models-like-gpt-j-codegen-16b-or-g-84c910ed
agent: performance
When does this happen?
IF When evaluating models like GPT-J, codegen-16B, or gpt-neox-20b on multiple GPUs using torch.distributed.launch or deepspeed, evaluation crashes with 'Tensors must be contiguous' error in the distributed gather step.
How others solved it
THEN Ensure that tensors are contiguous before the distributed gather operation. This can be done by adding `.contiguous()` calls in the model's forward method for the relevant outputs, or by using `model.to(memory_format=torch.contiguous_format)` as a workaround. Additionally, consider using `torch.distributed.all_gather` with contiguous tensors. The root cause is likely missing contiguous calls in the model modeling code.
if not logits.is_contiguous():
logits = logits.contiguous()Related patterns
performance
performance-performance-site-has-no-favicon-91b0eb8c
Tier 1 · 99%
gradient_accumulationperformance-gradient-accumulatio-gradient-accumulation-in-language-model-training-r-39d96261
Tier 1 · 70%
model_quantization_compatibilityperformance-model-quantization-c-vllm-fails-with-assert-self-quant-method-is-not-no-f8b7cad3
Tier 1 · 70%
model_config_mismatchperformance-model-config-mismatc-decode-error-nonetype-when-batch-inference-reaches-f7fadcca
Tier 1 · 70%
mps_backend_supportperformance-mps-backend-support-when-using-hugging-face-transformers-pipeline-with-5d2df106
Tier 1 · 70%
query_timeoutperformance-query-timeout-timeout-errors-occur-when-fetching-traces-with-spe-b5e0baa0
Tier 1 · 70%
Have you seen this in your site?
Connect AgentMinds to match against your tech stack automatically.