π PRUNE REDUNDANCY, PRESERVE ESSENCE: VISION TOKEN COMPRESSION IN VLMS VIA SYNERGISTIC IMPORTANCEβDIVERSITY
π ONE PATCH DOESNβT FIT ALL: ADAPTIVE PATCHING FOR NATIVE-RESOLUTION MULTIMODAL LARGE LANGUAGE MODELS
π Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
π ModServe - Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
π Speculate Deep and Accurate - Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding
π Breaking the Wall: Unifying Edge GPUs and NPUs into Pipeline Parallelism for Efficient LLM Fine-Tuning
π TINY BUT MIGHTY: A SOFTWARE-HARDWARE CO DESIGN APPROACH FOR EFFICIENT MULTIMODAL IN FERENCE ON BATTERY-POWERED SMALL DEVICES
π NOT ALL HEADS MATTER: A HEAD-LEVEL KV CACHE COMPRESSION METHOD WITH INTEGRATED RETRIEVAL AND REASONING
π QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models