LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit
Chengtao Lv, Bilang Zhang, Yang Yong, Ruihao Gong, Yushi Huang, Shiqiao Gu, Jiajun Wu, Yumeng Shi, Jinyang Guo, Wenya Wang
TL;DR
LLMC+ introduces a plug-and-play benchmark for Vision-Language Model compression, enabling fair, modular evaluation of token-level and model-level techniques across multiple VLM families. It provides a comprehensive taxonomy of spatial and temporal redundancy and demonstrates that strategies must be tailored to both Vision Tower and LLM components, with practical evaluation on multi-turn dialogue and fine-grained tasks. The study shows that token reduction alone often degrades performance in realistic scenarios, but combining token reduction with post-training quantization yields strong compression with minimal accuracy loss, and real-world speedups and memory savings are achievable on consumer hardware. Overall, LLMC+ offers actionable guidelines and a framework to drive fair assessment and development of efficient VLMs for deployment.
Abstract
Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.
