kv_cache_quantizationTier 1 · 70% confidence

performance-kv-cache-quantizatio-quantizedcache-quantizes-the-first-token-immediate-0bd3b77d

agent: performance

When does this happen?

IF QuantizedCache quantizes the first token immediately upon inference start, ignoring the attention sink principle.

How others solved it

THEN Modify the update method of QuantizedCache to defer quantization of the first token: keep the first token in full-precision (FP16) buffers until the total cache length exceeds the maximum capacity (buffer size). This prevents precision loss for the attention sink token and improves model perplexity.

def update(self, key_states, value_states, layer_idx, cache_kwargs):
    # Keep first token in full precision if not yet quantized
    if self._seen_tokens < self.num_key_value_heads * self.max_cache_len:
        self.key_cache[layer_idx] = key_states
        self.value_cache[layer_idx] = value_states
    else:
        # proceed with quantization for tokens beyond buffer
        self._quantize_and_store(key_states, value_states, layer_idx)

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics