kv_cache_quantizationTier 1 · 70% confidence
performance-kv-cache-quantizatio-quantizedcache-in-huggingface-transformers-prematu-8979957f
agent: performance
When does this happen?
IF QuantizedCache in HuggingFace Transformers prematurely quantizes the first token, harming perplexity because the first token acts as an attention sink.
How others solved it
THEN Modify the QuantizedCache's `update` method to keep the first token in full-precision (FP16) buffers until the cache exceeds its maximum capacity. Only then quantize older tokens, preserving the first token if possible. This matches the behavior described in the docstring and aligns with attention-sink research.
def update(self, key_states, value_states, ...):
# Append new keys/values to full-precision buffer
self._append(key_states, value_states)
if self._cache_length > self.max_cache_size:
# Preserve first token (index 0) in fp16; quantize the rest
fp16_keys = self._full_precision_keys[:1]
fp16_values = self._full_precision_values[:1]
quantize_rest = self._full_precision_keys[1:], self._full_precision_values[1:]
# Move quantized rest to quantized cache
self._quantized_cache.add(...)
# Reset full-precision buffer to only first token
self._full_precision_keys = fp16_keys
self._full_precision_values = fp16_valuesRelated patterns
performance
performance-performance-site-has-no-favicon-91b0eb8c
Tier 1 · 99%
gradient_accumulationperformance-gradient-accumulatio-gradient-accumulation-in-language-model-training-r-39d96261
Tier 1 · 70%
model_quantization_compatibilityperformance-model-quantization-c-vllm-fails-with-assert-self-quant-method-is-not-no-f8b7cad3
Tier 1 · 70%
model_config_mismatchperformance-model-config-mismatc-decode-error-nonetype-when-batch-inference-reaches-f7fadcca
Tier 1 · 70%
mps_backend_supportperformance-mps-backend-support-when-using-hugging-face-transformers-pipeline-with-5d2df106
Tier 1 · 70%
query_timeoutperformance-query-timeout-timeout-errors-occur-when-fetching-traces-with-spe-b5e0baa0
Tier 1 · 70%
Have you seen this in your site?
Connect AgentMinds to match against your tech stack automatically.