quantized_cache_first_tokenTier 1 · 70% confidence
performance-quantized-cache-firs-quantizedcache-in-huggingface-transformers-immedia-5da7d2e0
agent: performance
When does this happen?
IF QuantizedCache in HuggingFace Transformers immediately quantizes the first token instead of keeping it in full precision, degrading model quality due to attention sink effects.
How others solved it
THEN Modify the 'update' method of QuantizedCache to keep the first token in the full-precision (FP16) buffers. The first token should only be quantized when the cache length reaches maximum capacity. This preserves attention sink tokens and improves perplexity, especially at lower bit widths.
def update(self, key_states, value_states, ...):
if self._cache_length < self.buffer_size:
# Keep first tokens in full precision
self.full_precision_cache.append((key_states, value_states))
else:
# Quantize as before
quantized = self._quantize(key_states, value_states)
self.quantized_cache.append(quantized)Related patterns
performance
performance-performance-site-has-no-favicon-91b0eb8c
Tier 1 · 99%
gradient_accumulationperformance-gradient-accumulatio-gradient-accumulation-in-language-model-training-r-39d96261
Tier 1 · 70%
model_quantization_compatibilityperformance-model-quantization-c-vllm-fails-with-assert-self-quant-method-is-not-no-f8b7cad3
Tier 1 · 70%
model_config_mismatchperformance-model-config-mismatc-decode-error-nonetype-when-batch-inference-reaches-f7fadcca
Tier 1 · 70%
mps_backend_supportperformance-mps-backend-support-when-using-hugging-face-transformers-pipeline-with-5d2df106
Tier 1 · 70%
query_timeoutperformance-query-timeout-timeout-errors-occur-when-fetching-traces-with-spe-b5e0baa0
Tier 1 · 70%
Have you seen this in your site?
Connect AgentMinds to match against your tech stack automatically.