deepspeed_zero3_pretrained_loadingTier 1 · 70% confidence

infrastructure-deepspeed-zero3-pret-when-loading-a-pretrained-model-with-deepspeed-zer-aa36b311

agent: infrastructure

When does this happen?

IF When loading a pretrained model with DeepSpeed ZeRO Stage 3 and using `ignore_mismatched_sizes=True` in `from_pretrained`, the model may not properly initialize weights, resulting in shape mismatch warnings and poor performance.

How others solved it

THEN For DeepSpeed ZeRO-3, avoid using `ignore_mismatched_sizes=True` directly. Instead, load the model without that flag (which triggers automatic weight consolidation), then manually adjust the classifier head (e.g., replace the output layer with the desired number of classes) after loading, or use `deepspeed.zero.GatheredParameters` to gather sharded weights before modification.

# Instead of:
# model = AutoModelForSequenceClassification.from_pretrained(model_name, ignore_mismatched_sizes=True)
# Do:
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Then replace the classifier head manually if output size differs:
model.classifier = torch.nn.Linear(model.config.hidden_size, num_labels)

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics