SPECVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

Conference: EMNLP'25

Github: https://github.com/zju-jiyicheng/SpecVLM

1. Motivation

Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding.

例如： LLaVA-OneVision (Li et al., 2024a) 将每一帧处理为 196 个视觉 token。若视频为两分钟、60 FPS，则总 token 数量超过 100 万。如此大量的 video tokens 导致：

序列长度急剧增加；
Prefill 阶段的 attention 开销呈平方级增长；
Decoding 阶段 KV cache 急速膨胀，成为显著的 GPU 内存瓶颈。

在 autoregressive 生成过程中，每步生成的 KV cache 都必须与模型参数一起加载与存储于 GPU 显存，导致显著的 memory-bound 现象。

2. Challenge

为缓解 video tokens 数量引发的计算与存储爆炸，近期研究提出了多种 token pruning 策略，通过识别 token 冗余性、计算重要性差异，在 prefill 阶段 进行剪枝以减少后续解码开销。

然而，直接移除 tokens 会带来信息损失 —— 这对视频理解尤其致命，因为丰富的时空线索对于高质量生成至关重要。此外，单纯的 pruning 方法只能带来有限的加速，因为在每步生成时仍需访问完整参数。

Speculative Decoding as a Solution

Speculative decoding (SD) 提供了一个思路：使用一个轻量 draft 模型先生成多个候选 token，然后由 target 模型（verifier）并行验证。理论上，这能在不牺牲生成质量的情况下大幅提升解码速度（Leviathan et al., 2023）。

但将 SD 应用于 Vid-LLMs 存在两大挑战：

Draft 模型 KV cache 的线性增长：对于长视频输入，draft 的 KV cache 会随时间膨胀，使其延迟反而成为主要瓶颈；
视频模态的高冗余性与低密度信息分布：现有针对长上下文的 SD 方法（Sun et al., 2024; Chen et al., 2025; Yang et al., 2025a）都是“模态无关”的，无法利用视频注意力分布的特殊性，因此在视频任务中表现不佳。

于是论文提出：

通过 在 draft 模型中引入视频 token 削减（video token pruning），可有效减小其 KV cache 大小，从而提升 speculative decoding 效率。

2.1 Naive Speculative Decoding for Vid-LLMs

设：

目标模型（verifier）：( M_t )
草稿模型（draft）：( M_d )
( T_t )：目标模型单 token 解码时间
( T_d )：草稿模型单 token 解码时间
( T_t^\gamma )：目标模型验证 γ 个 token 的时间

则每次 speculative decoding step 的总时间为：

$$ T_{\text{step}}^\gamma = \gamma \cdot T_d + T_t^\gamma $$

平均每 token 时间为：

$$ T_{\text{token}}^\gamma = \frac{T_{\text{step}}^\gamma}{\tau} $$

其中 (\tau) 为平均 accept length。

因此速度提升比为：

$$ \text{Speedup} = \frac{T_t}{T_{\text{token}}^\gamma} = \frac{\tau \cdot T_t}{\gamma \cdot T_d + T_t^\gamma} = \frac{\tau}{\gamma \cdot \frac{T_d}{T_t} + \frac{T_t^\gamma}{T_t}} $$

通常情况下 ( T_t^\gamma / T_t \approx 1 )，因此 速度主要由平均接受长度 τ 和 latency 比 (T_d/T_t) 决定。

对于视频模型，随着输入长度增加，draft 的 KV cache 迅速膨胀，使 (T_d) 增大，从而削弱 speculative decoding 的加速效果。

2.2 Speculation Sensitivity for Token Pruning

为了降低 draft 模型的 KV cache，本研究引入 视频 token 削减。但问题在于：token 减少意味着视觉信息损失，是否会降低 speculation 的准确性？

论文通过实验发现：

在 VideoDetailCaption 基准上进行随机 token pruning；
当 pruning ratio ≤ 50% 时，平均接受长度 τ 几乎不变；
在某些情况下（适度 pruning）甚至提升 τ；
当完全移除所有视频 token（100% pruning）时，τ 与总体加速明显下降。

这说明：

视频输入存在大量冗余。适度削减不仅不会损害 speculative 准确度，反而能去除干扰性冗余，提升模型专注度。

然而随机削减（Random pruning）在高比率时表现不稳，尤其当关键帧或关键物体被删除时会严重损害性能。

3. Contribution

SPECVLM 提出了一种 verifier-guided、两阶段视频 token 削减（staged video token pruning） 策略，有效延伸 speculative decoding 在高剪枝比例下的加速优势。

核心思想：

通过 目标模型的注意力分布（attention guidance） 识别关键视频 token；
高注意力区域 采用 Top-P 保留；
低注意力区域 采用空间均匀下采样；
削减后的 tokens 被送入 draft 模型，以显著减小其 KV cache，从而提升 speculative decoding 效率。

主要贡献总结：

首次探索视频 LLM 的无损 speculative decoding 加速。 发现 “视频 token 爆炸” 是 draft slowdown 的核心原因，并提出针对性削减方案。
发现 draft 模型对随机剪枝的不敏感性（speculation insensitivity），由此提出 Verifier-guided staged pruning 策略，在高比例剪枝下仍保持高接受率。
实验结果：
- 剪枝 90% 视频 tokens 后仍保留约 90% speculation accuracy；
- LLaVA-OneVision 加速 2.68×；
- Qwen2.5-VL 加速 2.11×。

4. Method

4.1 Attention-Guided Token Importance Estimation

SPECVLM 利用目标模型的 language-to-video attention 来判断视频 token 的重要性。

设：

(L)：语言 token 集；
(V)：视频 token 集；
(G \in \mathbb{R}^{|L|\times|V|})：语言到视频的注意力矩阵。

定义每个视频 token (j) 的重要性分数为：

$$ a_j = \frac{1}{|L|} \sum_{i=1}^{|L|} G_{i,j} $$

即语言 token 对第 j 个视频 token 的平均注意力。在实现中，会对所有层与头取平均，形成最终的 attention map (A = {a_j})。

4.2 Two-Stage Video Token Pruning

SPECVLM 提出 Two-Stage Token Pruning：（1）Top-P Retention；（2）Spatially Uniform Reduction。

Stage I — Top-P Retention

论文发现 video attention 分布呈长尾形态：少数 token 占据了大部分注意力。因此首先选取高注意力 token。

定义累计注意力阈值 (\lambda_r)，求出最小 c 满足：

$$ \frac{\sum_{i=1}^{c} a_{(i)}}{\sum_{j=1}^{|V|} a_j} \ge \lambda_r $$

其中 (a_{(i)}) 为第 i 大的 attention 值。

保留这 c 个 token 构成集合 (V_R)。

(\lambda_r) 与 pruning ratio (r) 通过小规模校准集确定，例如在 LLaVA-OneVision 上使用 (\lambda_r=0.4) 对应 (r=0.9)。

Stage II — Spatially Uniform Reduction

对于剩余 tokens (V \setminus V_R)，由于注意力值接近且空间位置相近，直接删除会破坏视频的时空结构。因此论文设计了空间均匀采样策略：

设剩余 token 数量为 (|V| - |V_R|)，则采样间隔为：

$$ I = \frac{|V| - |V_R|}{(1-r)|V|} $$

以此间隔在空间上（如每帧的 14×14 patch grid）均匀采样，形成集合 (V_U)。

最终保留集合：

$$ V’ = V_R \cup V_U $$

这样既保留了语义关键信息，又维持了空间结构的完整性。

4.3 Why Verifier Guidance Works

由于 verifier 是最终生成分布的参考，其 language-to-video attention 能直接反映哪些视频区域被语言输出所依赖。因此，这种 attention guidance 是一种自然的“importance estimator”，比启发式剪枝更可靠。

4.4 Complexity Analysis

假设原始视频 tokens 为 (N)，prune 比例为 (r)，则 draft KV cache 大小下降到 ((1-r)N)。 prefill 与 decode 的 KV load/store 开销近似线性减小。

总体 speculative step latency： $$ T_{\text{step}}^\gamma = \gamma \cdot T_d(r) + T_t^\gamma $$

由于 (T_d(r) \propto (1-r))，可得理论加速： $$ \text{Speedup} \approx \frac{\tau}{\gamma(1-r)\frac{T_d}{T_t}+1} $$

5. Evaluation

5.1 实验设置

模型系列：
- LLaVA-OneVision (72B / 7B)
- Qwen2.5-VL (32B / 7B)
场景：
- Standard SD (Std.-SD)：大模型 + 小 draft 模型；
- Self-SD：同一模型自生成，自剪枝。
任务基准：
- VideoDetailCaption (LMMs-Lab, 2024)
- MVBench, MVLU, LongVideoBench
输入设定：
- 采样 64–128 帧；
- 每帧 196 tokens；
- 默认剪枝率 (r=0.9)。
硬件：
- 8×A100 GPUs；
- Spec length γ=5；
- 平均 over 50 samples。

5.2 Baselines

Baseline	描述
Vanilla	普通 autoregressive 解码
SD-Tree	EAGLE 风格树形 speculation
SD-Rand	SD + 随机 token 剪枝
SD-Window / Frame / DyCoke / FastVID	空间/时间冗余剪枝基线
SD-Uniform	空间均匀抽样（无 verifier guidance）

5.3 主结果分析

LLaVA-OneVision (72B-7B, Std.-SD)

方法	τ	Tokens/s	Speedup
Vanilla	–	2.94	1.0×
SD-Tree	3.57	6.41	2.18×
SD-Rand (r=0.9)	3.19	7.36	2.50×
SPECVLM (r=0.9)	3.48	7.88	2.68×

结论：

剪枝 90% tokens 后仍保持 97% 的 τ；
draft KV 减少 → prefill+decode latency 降低；
整体加速比随机剪枝更高。

Qwen2.5-VL (32B-7B, Std.-SD)

方法	τ	Tokens/s	Speedup
Vanilla	–	2.56	1.0×
SPECVLM (r=0.9)	3.25	5.40	2.11×

说明 SPECVLM 对不同体系架构具有一致加速效果。

Self-SD 场景

当没有独立 draft 模型时，SPECVLM 在 Self-SD 下仍带来约 1.3× 加速。此时 pruning 仅减少自身 KV cache，无需额外模型。

5.4 Scaling Law for Pruning Ratio

论文在 Figure 6 展示了随剪枝率 r 变化的 τ 曲线：

随着 r 从 0 → 0.9，SPECVLM 的 τ 下降幅度远小于 Random / Uniform；
当 r 超过 0.8 时，SPECVLM 依旧保持稳定；
说明 verifier-guided 策略能在极高压缩率下保留关键信息。

5.5 Ablation Study

仅使用 spatial uniform（无 attention guidance）： τ 明显下降，尤其在复杂场景下，说明 attention 引导对鲁棒性关键。
仅使用 Top-P（无 uniform sampling）：高注意力区保留但空间结构断裂，造成语义不连贯。
双阶段保留（SPECVLM）：兼顾语义与空间一致性，在各比例下表现最优。

5.6 Latency Breakdown

表 5 显示时间分布（LLaVA-72B/7B，输出 256 tokens）：

模块	Vanilla	SPECVLM
Prefill (draft)	24.2s	9.6s
Draft decode	28.9s	12.5s
Target verify	57.9s	35.2s
Total	111s	57s

削减 draft KV 大小直接减少 draft prefill 与 decode 延迟，使总推理时间减半。

5.7 Early-step Accept Length Stability

论文进一步研究 τ 在解码过程的分布（表 3）：

前 10 步的 τ 与整体平均 τ 几乎一致；
说明 SPECVLM 剪枝不会导致早期 speculation 失效。

5.8 可视化结果

左图显示 SPECVLM 保留 token 后的注意力热图仍与原始模型对齐良好；中图展示不同 r 下 τ 稳定；右图为 speedup 与 τ 的平衡曲线。

6. Limitation

主要适用于长视频、资源受限场景：当 GPU 带宽为主要瓶颈时效果显著。
需要额外 draft 模型：

虽然开销相对较小，仍需选择合适草稿模型。

Training-free 设计限制最大加速：若未来能训练更轻量化 Vid-LLM draft，可进一步提升性能。

7. Conclusion

We propose SPECVLM, the first training-free speculative decoding framework tailored for accelerating video LLMs. Building on the low speculation sensitivity to token pruning, SPECVLM leverages verifier-guided attention to remove redundant video tokens, significantly reducing the draft model’s KV cache without compromising generation quality.

SPECVLM achieves:

2.68× speedup on LLaVA-OneVision-72B
2.11× speedup on Qwen2.5-VL-32B

It provides a general, plug-and-play, training-free acceleration framework for long video reasoning.

1. Motivation#

2. Challenge#

Speculative Decoding as a Solution#

2.1 Naive Speculative Decoding for Vid-LLMs#

2.2 Speculation Sensitivity for Token Pruning#

3. Contribution#

4. Method#

4.1 Attention-Guided Token Importance Estimation#

4.2 Two-Stage Video Token Pruning#

Stage I — Top-P Retention#

Stage II — Spatially Uniform Reduction#

4.3 Why Verifier Guidance Works#

4.4 Complexity Analysis#

5. Evaluation#

5.1 实验设置#

5.2 Baselines#

5.3 主结果分析#

LLaVA-OneVision (72B-7B, Std.-SD)#

Qwen2.5-VL (32B-7B, Std.-SD)#

Self-SD 场景#

5.4 Scaling Law for Pruning Ratio#

5.5 Ablation Study#

5.6 Latency Breakdown#

5.7 Early-step Accept Length Stability#

5.8 可视化结果#

6. Limitation#

7. Conclusion#