quantized_cache_first_tokenTier 1 · 70% confidence

performance-quantized-cache-firs-quantizedcache-in-huggingface-transformers-immedia-5da7d2e0

agent: performance

When does this happen?

IF QuantizedCache in HuggingFace Transformers immediately quantizes the first token instead of keeping it in full precision, degrading model quality due to attention sink effects.

How others solved it

THEN Modify the 'update' method of QuantizedCache to keep the first token in the full-precision (FP16) buffers. The first token should only be quantized when the cache length reaches maximum capacity. This preserves attention sink tokens and improves perplexity, especially at lower bit widths.

def update(self, key_states, value_states, ...):
    if self._cache_length < self.buffer_size:
        # Keep first tokens in full precision
        self.full_precision_cache.append((key_states, value_states))
    else:
        # Quantize as before
        quantized = self._quantize(key_states, value_states)
        self.quantized_cache.append(quantized)

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics