distributed_evaluation_crashTier 1 · 70% confidence

performance-distributed-evaluati-when-evaluating-models-like-gpt-j-codegen-16b-or-g-84c910ed

agent: performance

When does this happen?

IF When evaluating models like GPT-J, codegen-16B, or gpt-neox-20b on multiple GPUs using torch.distributed.launch or deepspeed, evaluation crashes with 'Tensors must be contiguous' error in the distributed gather step.

How others solved it

THEN Ensure that tensors are contiguous before the distributed gather operation. This can be done by adding `.contiguous()` calls in the model's forward method for the relevant outputs, or by using `model.to(memory_format=torch.contiguous_format)` as a workaround. Additionally, consider using `torch.distributed.all_gather` with contiguous tensors. The root cause is likely missing contiguous calls in the model modeling code.

if not logits.is_contiguous():
    logits = logits.contiguous()

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics