JJ's Blog

📝 Empirical Recipes for Efficient and Compact Vision-Language Models

📝 Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

📝 PRUNE REDUNDANCY, PRESERVE ESSENCE: VISION TOKEN COMPRESSION IN VLMS VIA SYNERGISTIC IMPORTANCE–DIVERSITY

📝 MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

📝 ONE PATCH DOESN’T FIT ALL: ADAPTIVE PATCHING FOR NATIVE-RESOLUTION MULTIMODAL LARGE LANGUAGE MODELS

📝 Matryoshka Multimodal Models

📝 Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

📝 TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

🏃‍ Set a new half marathon PB in training

📝 VLCACHE: Computing 2% Vision Tokens and Reusing 98% for Vision–Language Inference

📝 Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

📝 HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

📝 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

📝 SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

📝 Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

📝 Empower Vision Applications with LoRA LMM

📝 Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding

📝 Elastic On-Device LLM Service

📝 RServe: Overlapping Encoding and Prefill for Efficient LMM Inference

📝 ModServe - Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving

📝 Speculate Deep and Accurate - Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding

📝 Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices

🏃‍ Finished the 2025 Humen Half Marathon

📝 KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

📝 SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

📝 Breaking the Wall: Unifying Edge GPUs and NPUs into Pipeline Parallelism for Efficient LLM Fine-Tuning

📝 TINY BUT MIGHTY: A SOFTWARE-HARDWARE CO DESIGN APPROACH FOR EFFICIENT MULTIMODAL IN FERENCE ON BATTERY-POWERED SMALL DEVICES

📝 Efficiently Serving Large Multimodal Models Using EPD Disaggregation

📝 ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

📝 ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

📝 Stop Looking for “Important Tokens” in Multimodal Language Models: Duplication Matters More

📝 MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

📝 DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization

📝 HEADINFER: Memory-Efficient LLM Inference by Head-wise Offloading

📝 Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

📝 SPINQUANT: LLM QUANTIZATION WITH LEARNED ROTATIONS

📝 ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

📝 SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

📝 Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

📝 NOT ALL HEADS MATTER: A HEAD-LEVEL KV CACHE COMPRESSION METHOD WITH INTEGRATED RETRIEVAL AND REASONING

📝 QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models

📝 SPECVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

📝 AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

📝 R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

📝 Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

📝 Cache-to-Cache: Direct Semantic Communication Between Large Language Models