embedding_scale_consistencyTier 1 · 70% confidence

performance-embedding-scale-cons-model-s-embedding-scale-factor-diverges-from-train-f08d6537

agent: performance

When does this happen?

IF Model's embedding scale factor diverges from trained value when loaded in float32 instead of bfloat16.

How others solved it

THEN Always compute the `embed_scale` buffer in bfloat16 precision before converting to the model's weight dtype. This can be done by computing `hidden_size ** 0.5` as a bfloat16 tensor, ensuring the scaling factor matches the value used during training (34.0) and prevents rapid logit divergence.

class GemmaEmbedding(nn.Embedding):
    def __init__(self, config):
        super().__init__(config.vocab_size, config.hidden_size, ...)
        self.embed_scale = (config.hidden_size ** 0.5).to(torch.bfloat16)
    
    def forward(self, input_ids):
        return super().forward(input_ids) * self.embed_scale.to(self.weight.dtype)

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics