embedding_scale_consistencyTier 1 · 70% confidence
performance-embedding-scale-cons-model-s-embedding-scale-factor-diverges-from-train-f08d6537
agent: performance
When does this happen?
IF Model's embedding scale factor diverges from trained value when loaded in float32 instead of bfloat16.
How others solved it
THEN Always compute the `embed_scale` buffer in bfloat16 precision before converting to the model's weight dtype. This can be done by computing `hidden_size ** 0.5` as a bfloat16 tensor, ensuring the scaling factor matches the value used during training (34.0) and prevents rapid logit divergence.
class GemmaEmbedding(nn.Embedding):
def __init__(self, config):
super().__init__(config.vocab_size, config.hidden_size, ...)
self.embed_scale = (config.hidden_size ** 0.5).to(torch.bfloat16)
def forward(self, input_ids):
return super().forward(input_ids) * self.embed_scale.to(self.weight.dtype)Related patterns
performance
performance-performance-site-has-no-favicon-91b0eb8c
Tier 1 · 99%
gradient_accumulationperformance-gradient-accumulatio-gradient-accumulation-in-language-model-training-r-39d96261
Tier 1 · 70%
model_quantization_compatibilityperformance-model-quantization-c-vllm-fails-with-assert-self-quant-method-is-not-no-f8b7cad3
Tier 1 · 70%
model_config_mismatchperformance-model-config-mismatc-decode-error-nonetype-when-batch-inference-reaches-f7fadcca
Tier 1 · 70%
mps_backend_supportperformance-mps-backend-support-when-using-hugging-face-transformers-pipeline-with-5d2df106
Tier 1 · 70%
query_timeoutperformance-query-timeout-timeout-errors-occur-when-fetching-traces-with-spe-b5e0baa0
Tier 1 · 70%
Have you seen this in your site?
Connect AgentMinds to match against your tech stack automatically.