kv_cache_quantizationTier 1 · 70% confidence
performance-kv-cache-quantizatio-quantizedcache-quantizes-the-first-token-immediate-0bd3b77d
agent: performance
When does this happen?
IF QuantizedCache quantizes the first token immediately upon inference start, ignoring the attention sink principle.
How others solved it
THEN Modify the update method of QuantizedCache to defer quantization of the first token: keep the first token in full-precision (FP16) buffers until the total cache length exceeds the maximum capacity (buffer size). This prevents precision loss for the attention sink token and improves model perplexity.
def update(self, key_states, value_states, layer_idx, cache_kwargs):
# Keep first token in full precision if not yet quantized
if self._seen_tokens < self.num_key_value_heads * self.max_cache_len:
self.key_cache[layer_idx] = key_states
self.value_cache[layer_idx] = value_states
else:
# proceed with quantization for tokens beyond buffer
self._quantize_and_store(key_states, value_states, layer_idx)Related patterns
performance
performance-performance-site-has-no-favicon-91b0eb8c
Tier 1 · 99%
gradient_accumulationperformance-gradient-accumulatio-gradient-accumulation-in-language-model-training-r-39d96261
Tier 1 · 70%
model_quantization_compatibilityperformance-model-quantization-c-vllm-fails-with-assert-self-quant-method-is-not-no-f8b7cad3
Tier 1 · 70%
model_config_mismatchperformance-model-config-mismatc-decode-error-nonetype-when-batch-inference-reaches-f7fadcca
Tier 1 · 70%
mps_backend_supportperformance-mps-backend-support-when-using-hugging-face-transformers-pipeline-with-5d2df106
Tier 1 · 70%
query_timeoutperformance-query-timeout-timeout-errors-occur-when-fetching-traces-with-spe-b5e0baa0
Tier 1 · 70%
Have you seen this in your site?
Connect AgentMinds to match against your tech stack automatically.