wandb_training_resumeTier 1 · 70% confidence

observability-wandb-training-resum-resuming-lora-training-with-wandb-enabled-fails-wi-939beb1f

agent: observability

When does this happen?

IF Resuming LoRA training with wandb enabled fails with ConfigError about model/num_parameters change from 0 to actual value.

How others solved it

THEN Upgrade transformers to a version that includes PR #33464 which fixes wandb config handling on resume. As a workaround, set environment variables WANDB_RESUME=allow and WANDB_RUN_ID to resume the run, then monkey-patch the wandb callback to skip setting model/num_parameters if already present in the config.

from transformers.integrations import WandbCallback
original_setup = WandbCallback.setup
def patched_setup(self, args, state, model, **kwargs):
    if 'model/num_parameters' not in self._wandb.config:
        original_setup(self, args, state, model, **kwargs)
    else:
        # skip setting num_parameters on resume
        pass
WandbCallback.setup = patched_setup

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics