R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Conference: NeurIPS'25

Github: https://github.com/Zefan-Cai/R-KV

My Thoughts

R-KV 针对previous works在efficient reasoning中由于只依靠attention scores而造成过多地保留了redundant tokens(重复而且对推理没有信息增益的token),R-KV的重点我觉得是增加了Redundancy Estimation via Semantic Similarity来解决这个问题,就Evaluation的结果来说，R-KV的效果是非常好的，即提升了Throughput还maintain了performance，甚至在budget充足的情况下还可以做到提点。

1. Motivation

Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference.

For instance, a DeepSeek-R1-Distill-Llama-8B model may generate 32K tokens to solve a complex math problem, consuming 15.5GB of memory to load model weights and 4.1GB to store the KV cache. This long CoT (chain-of-thought) generation necessitates the development of KV cache compression.

这种模型在推理时经常会产生极长的生成序列，其中包含大量冗余内容，如：

重复的自我反思（self-reflection）
多次重复的推理尝试
冗长的自对话式输出

这些部分虽然在语义上贡献有限，却显著增加了 KV 缓存长度和内存负担。

2. Challenge

While chain-of-thought inference significantly improves performance on complex reasoning tasks, existing KV cache compression methods often fail in this scenario.

原因： 传统压缩方法多基于 attention 重要性过滤（如按注意力权重筛选 token），但 reasoning 模型的输出往往包含高度重复的段落。由于这些重复段落之间互相强化注意力得分，简单基于注意力的裁剪策略会：

错误地保留冗余重复内容；
错误地删除离散但关键的 reasoning 片段。

最终导致推理链断裂或答案错误。

3. Contribution

We propose R-KV (Redundancy-aware KV Cache Compression) — a novel decoding-time KV compression method for reasoning LLMs.

核心创新包括三部分：

Attention-based importance scoring：根据注意力分数衡量每个 token 的上下文重要性。
Dynamic redundancy scoring：通过 Key 向量语义相似度动态估计 token 的冗余度。
Joint eviction mechanism：结合重要性与冗余度的联合打分机制，优化缓存利用率。

实验结果：

在保留仅 10–34% 的原始 KV cache 情况下，性能与未压缩模型几乎一致；
在 AIME-24 数据集上，仅用 16% KV cache 即实现 105% of FullKV accuracy；
方法训练无关、模型无关，可直接部署于 RL rollout 或推理服务。

4. Observation

4.1 Redundancy in Reasoning Models

观察：

多种 reasoning 模型（DeepSeek-R1-Distill-Llama-8B、DeepSeek-R1-Distill-Qwen-7B、Qwen-14B）生成的输出长度普遍比 ground truth 长 8 倍以上；
输出中 1-gram 与 2-gram 的平均重复频率显著高于参考答案，表明生成文本存在显著的冗余结构。

4.2 Failure of Existing KV Compression Methods

现有如 SnapKV 等 attention-based 方法倾向于保留重复的 reasoning 片段，例如重复的 “reflection” 或 “final answer” 段落，造成有效上下文被冗余信息挤占。

R-KV 的目标正是针对这种冗余失效模式进行修正。

5. Method

R-KV 的核心思想是在 decoding 阶段（即生成时）实时压缩 KV cache，动态选择最有用的历史 token，同时过滤掉冗余内容。

整个方法包括以下模块：

5.1 Decoding-time Compression

与大多数仅作用于 prefilling 阶段 的方法不同，R-KV 针对 decoding 阶段设计。

设定两个缓存区：
- Cache 区：存放保留的 KV 对，容量为 ( B_{\text{budget}} )
- Buffer 区：存放最新生成的 token，容量为 ( B_{\text{buffer}} )
总内存需求： $$ B_{\text{total}} = B_{\text{budget}} + B_{\text{buffer}} $$

压缩步骤（循环执行）：

模型生成一段长度为 ( B_{\text{buffer}} ) 的文本；
Buffer 的最后 (\alpha) 个 token（称作 observation tokens）始终保留；
拼接： $$ n = B_{\text{budget}} + B_{\text{buffer}} - \alpha $$ 形成候选集合；
对所有候选 token 计算综合得分；
选出前 (k = B_{\text{budget}} - \alpha) 个 token + (\alpha) 个 observation token 组成新的 cache。

5.2 Importance Scoring via Attention Weights

对每个 attention head (h)：

$$ A^{(h)} = \operatorname{softmax}!\left(\frac{Q^{(h)} (K^{(h)})^\top}{\sqrt{d}}\right) $$

其中：

( Q^{(h)} \in \mathbb{R}^{\alpha \times d} )：最近 (\alpha) 个 query；
( K^{(h)} \in \mathbb{R}^{n \times d} )：候选 key。

为了防止极端注意力值导致不稳定，在 token 维度上进行滑窗最大池化（window size = (2W)）：

$$ \tilde A^{(h)}{j,i} = \max(A^{(h)}{j,i-W},\ldots,A^{(h)}_{j,i+W-1}) $$

最后对 query 平均化得到 importance score：

$$ I^{(h)}i = \frac{1}{\alpha}\sum{j=0}^{\alpha-1}\tilde A^{(h)}_{j,i} $$

5.3 Redundancy Estimation via Semantic Similarity

目标：找出语义重复的 token。

归一化每个 key 向量： $$ \hat K^{(h)}_i = \frac{K^{(h)}_i}{|K^{(h)}_i|_2 + \varepsilon} $$
计算余弦相似度矩阵： $$ S^{(h)} = \hat K^{(h)} (\hat K^{(h)})^\top,\quad S^{(h)}_{i,i}=0 $$
为避免误删最近相关的重复 token，对每个 token (i)：
- 取与其相似度高于阈值 (T) 的集合 (\mathcal{I}^{(h)}i = {j | S^{(h)}{j,i} > T})
- 仅保留最近 (\beta) 个（即生成时间上靠近的），其他相似 token 设为 0： $$ S^{(h)}{j,i} = 0,\quad \forall j \in \mathcal{I}^{(h)}{i,\beta} $$
计算平均相似度并归一化： $$ \bar S^{(h)}i = \frac{1}{n}\sum{j=0}^{n-1} S^{(h)}_{j,i}, \quad R^{(h)}_i = \operatorname{softmax}(\bar S^{(h)}) $$

高相似度意味着冗余度高，应被淘汰。

5.4 Joint Selection Strategy

最终综合得分为：

$$ Z^{(h)}_i = \lambda I^{(h)}_i - (1 - \lambda) R^{(h)}_i $$

(\lambda)：控制重要性与冗余的平衡
- (\lambda \to 1)：接近 attention-only
- (\lambda \to 0)：接近 redundancy-only

论文实验表明：最佳范围 ( 0.01 \le \lambda \le 0.1 )，默认取 ( \lambda=0.1 )。

5.5 Aggregation and Selection

对所有 head 的结果取平均： $$ Z_i = \frac{1}{H}\sum_h Z^{(h)}_i $$

然后选出 top-(k = B_{\text{budget}} - \alpha) 个 token： $$ \text{Idx}_{\text{sel}} = \operatorname{TopK}(Z, k) $$

最终 cache： $$ K_{\text{new}} = [K_{\text{cand}}[\text{Idx}{\text{sel}}]; K{\text{obs}}] $$

6. Evaluation

6.1 Setup

Models:
- DeepSeek-R1-Distill-Llama-8B
- DeepSeek-R1-Distill-Qwen-14B
Datasets:
- MATH-500
- AIME-24 / AIME-25
Hyperparams:
- ( B_{\text{buffer}} = 128 ), ( \alpha = 8 ), ( \lambda = 0.1 )
Sampling: temperature = 0.6, top-p = 0.95
Metric: pass@1

6.2 Baselines

FullKV — 无压缩基线；
SnapKV — attention-only 压缩方法；
R-KV — 本文方法。

压缩周期：每生成 128 token 后执行一次。

6.3 Main Results

(1) DeepSeek-R1-Distill-Llama-8B

在 MATH-500 上：
- 34% KV cache 即可保持与 FullKV 几乎一致性能。
在 AIME-24 上：
- 10% KV cache 即能无损压缩；
- 16% KV cache 时性能甚至 超越 FullKV (105%)。

(2) DeepSeek-R1-Distill-Qwen-14B

在 MATH-500 上：lossless ratio 为 54%；
在 AIME-24 上：lossless ratio 为 25%；
在 33% cache 情况下达到 105% accuracy。

(3) 对比 SnapKV

SnapKV 往往错误保留大量重复反思段落； R-KV 通过冗余得分检测，能有效过滤此类内容，在所有预算条件下均优于 SnapKV。

6.4 Efficiency & Throughput

在 Llama3-8B 上：
- 压缩到 10% KV cache 时，内存减少约 90%；
- 吞吐提升可达 4.5–9×；
- 最大 batch size 提升 7–13×。
计算开销：虽然 redundancy 估计需要额外计算，但由于 attention 计算规模降低，整体推理速度仍提升。

6.5 Quantitative Summary

Model	Dataset	KV Retained	Accuracy (vs Full)	Speedup	Memory ↓
R1-Llama-8B	MATH-500	34%	≈100%	4.3×	72%
R1-Llama-8B	AIME-24	16%	105%	6.5×	84%
R1-Qwen-14B	AIME-24	33%	105%	7.2×	80%

7. Discussion

7.1 How to Choose λ

λ 控制重要性与冗余的权衡：

当 λ → 0，仅依赖冗余，可能误删关键信息；
当 λ → 1，仅依赖注意力，冗余压缩失效；
最佳区间： $$ 0.01 \le λ \le 0.1 $$
经验设置：
- reasoning 冗余较重时取小 λ；
- context 稳定性要求高时取稍大 λ。

7.2 Failure of Attention-Based Methods

SnapKV 等方法仅依据 attention，导致重复句子被反复保留。 R-KV 通过冗余矩阵有效识别这些重复项并删除。

关键区别： Attention ≠ Information Novelty 注意力表示“被关注”，不代表“信息增量”。

7.3 Efficiency Analysis

R-KV 压缩后 attention 矩阵规模下降，计算量从 (O(n^2 d)) 降到 (O(k n d))。
冗余计算为 (O(n^2 d))，但 n 远小于完整序列长度，整体收益显著。
实测表明：
- batch size ↑ 13.4×
- throughput ↑ 9.19×
- memory ↓ 90%

8. Conclusion

We introduced R-KV, a decoding-time KV cache compression method for reasoning LLMs.

通过联合建模 importance 与 redundancy， R-KV 在保留仅 10–34% 的 KV cache 下，性能与 FullKV 持平或超越。

其训练无关、模型无关、实现简单，可直接集成至 RL rollout 或推理服务，在长序列生成中带来高达 9× 吞吐提升、13× batch 扩展能力。

R-KV = 保性能 + 减内存 + 提吞吐 是 reasoning 型 LLM 推理优化的重要一步。

My Thoughts#

1. Motivation#

2. Challenge#

3. Contribution#

4. Observation#

4.1 Redundancy in Reasoning Models#

4.2 Failure of Existing KV Compression Methods#

5. Method#

5.1 Decoding-time Compression#

5.2 Importance Scoring via Attention Weights#

5.3 Redundancy Estimation via Semantic Similarity#

5.4 Joint Selection Strategy#

5.5 Aggregation and Selection#

6. Evaluation#

6.1 Setup#

6.2 Baselines#

6.3 Main Results#

(1) DeepSeek-R1-Distill-Llama-8B#

(2) DeepSeek-R1-Distill-Qwen-14B#

(3) 对比 SnapKV#

6.4 Efficiency & Throughput#

6.5 Quantitative Summary#

7. Discussion#

7.1 How to Choose λ#

7.2 Failure of Attention-Based Methods#

7.3 Efficiency Analysis#

8. Conclusion#