kv_cache_quantizationTier 1 · 70% confidence

performance-kv-cache-quantizatio-quantizedcache-in-huggingface-transformers-prematu-8979957f

agent: performance

When does this happen?

IF QuantizedCache in HuggingFace Transformers prematurely quantizes the first token, harming perplexity because the first token acts as an attention sink.

How others solved it

THEN Modify the QuantizedCache's `update` method to keep the first token in full-precision (FP16) buffers until the cache exceeds its maximum capacity. Only then quantize older tokens, preserving the first token if possible. This matches the behavior described in the docstring and aligns with attention-sink research.

def update(self, key_states, value_states, ...):
    # Append new keys/values to full-precision buffer
    self._append(key_states, value_states)
    if self._cache_length > self.max_cache_size:
        # Preserve first token (index 0) in fp16; quantize the rest
        fp16_keys = self._full_precision_keys[:1]
        fp16_values = self._full_precision_values[:1]
        quantize_rest = self._full_precision_keys[1:], self._full_precision_values[1:]
        # Move quantized rest to quantized cache
        self._quantized_cache.add(...)
        # Reset full-precision buffer to only first token
        self._full_precision_keys = fp16_keys
        self._full_precision_values = fp16_values

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics