πŸ“ Empirical Recipes for Efficient and Compact Vision-Language Models

March 24, 2026

πŸ“ Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

March 23, 2026

πŸ“ PRUNE REDUNDANCY, PRESERVE ESSENCE: VISION TOKEN COMPRESSION IN VLMS VIA SYNERGISTIC IMPORTANCE–DIVERSITY

March 1, 2026

πŸ“ MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

February 28, 2026

πŸ“ ONE PATCH DOESN’T FIT ALL: ADAPTIVE PATCHING FOR NATIVE-RESOLUTION MULTIMODAL LARGE LANGUAGE MODELS

February 2, 2026

πŸ“ Matryoshka Multimodal Models

January 30, 2026

πŸ“ Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

January 22, 2026

πŸ“ TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

January 21, 2026

πŸƒβ€ Set a new half marathon PB in training

January 16, 2026

πŸ“ VLCACHE: Computing 2% Vision Tokens and Reusing 98% for Vision–Language Inference

January 6, 2026

πŸ“ Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

December 27, 2025

πŸ“ HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

December 23, 2025

πŸ“ VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

December 21, 2025

πŸ“ SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

December 20, 2025

πŸ“ Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

December 20, 2025

πŸ“ Empower Vision Applications with LoRA LMM

December 15, 2025

πŸ“ Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding

December 12, 2025

πŸ“ Elastic On-Device LLM Service

December 8, 2025

πŸ“ RServe: Overlapping Encoding and Prefill for Efficient LMM Inference

December 6, 2025

πŸ“ ModServe - Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving

November 27, 2025

πŸ“ Speculate Deep and Accurate - Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding

November 27, 2025

πŸ“ Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices

November 26, 2025

πŸƒβ€ Finished the 2025 Humen Half Marathon

November 23, 2025

πŸ“ KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

November 21, 2025

πŸ“ SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

November 18, 2025

πŸ“ Breaking the Wall: Unifying Edge GPUs and NPUs into Pipeline Parallelism for Efficient LLM Fine-Tuning

November 17, 2025

πŸ“ TINY BUT MIGHTY: A SOFTWARE-HARDWARE CO DESIGN APPROACH FOR EFFICIENT MULTIMODAL IN FERENCE ON BATTERY-POWERED SMALL DEVICES

November 17, 2025

πŸ“ Efficiently Serving Large Multimodal Models Using EPD Disaggregation

November 15, 2025

πŸ“ ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

November 13, 2025

πŸ“ ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

November 12, 2025

πŸ“ Stop Looking for β€œImportant Tokens” in Multimodal Language Models: Duplication Matters More

November 12, 2025

πŸ“ MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

November 10, 2025

πŸ“ DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization

November 9, 2025

πŸ“ HEADINFER: Memory-Efficient LLM Inference by Head-wise Offloading

November 9, 2025

πŸ“ Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

November 7, 2025

πŸ“ SPINQUANT: LLM QUANTIZATION WITH LEARNED ROTATIONS

November 6, 2025

πŸ“ ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

November 5, 2025

πŸ“ SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

November 3, 2025

πŸ“ Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

November 1, 2025

πŸ“ NOT ALL HEADS MATTER: A HEAD-LEVEL KV CACHE COMPRESSION METHOD WITH INTEGRATED RETRIEVAL AND REASONING

October 31, 2025

πŸ“ QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models

October 30, 2025

πŸ“ SPECVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

October 29, 2025

πŸ“ AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

October 27, 2025

πŸ“ R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

October 25, 2025

πŸ“ Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

October 24, 2025

πŸ“ Cache-to-Cache: Direct Semantic Communication Between Large Language Models

October 15, 2025