Paper Note | JJ's Blog

SPINQUANT - LLM QUANTIZATION WITH LEARNED ROTATIONS

ViSpec - Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

NOT ALL HEADS MATTER: A HEAD-LEVEL KV CACHE COMPRESSION METHOD WITH INTEGRATED RETRIEVAL AND REASONING

QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models

SPECVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs