Table of Contents
Fetching ...

LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

Chengtao Lv, Bilang Zhang, Yang Yong, Ruihao Gong, Yushi Huang, Shiqiao Gu, Jiajun Wu, Yumeng Shi, Jinyang Guo, Wenya Wang

TL;DR

LLMC+ introduces a plug-and-play benchmark for Vision-Language Model compression, enabling fair, modular evaluation of token-level and model-level techniques across multiple VLM families. It provides a comprehensive taxonomy of spatial and temporal redundancy and demonstrates that strategies must be tailored to both Vision Tower and LLM components, with practical evaluation on multi-turn dialogue and fine-grained tasks. The study shows that token reduction alone often degrades performance in realistic scenarios, but combining token reduction with post-training quantization yields strong compression with minimal accuracy loss, and real-world speedups and memory savings are achievable on consumer hardware. Overall, LLMC+ offers actionable guidelines and a framework to drive fair assessment and development of efficient VLMs for deployment.

Abstract

Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.

LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

TL;DR

LLMC+ introduces a plug-and-play benchmark for Vision-Language Model compression, enabling fair, modular evaluation of token-level and model-level techniques across multiple VLM families. It provides a comprehensive taxonomy of spatial and temporal redundancy and demonstrates that strategies must be tailored to both Vision Tower and LLM components, with practical evaluation on multi-turn dialogue and fine-grained tasks. The study shows that token reduction alone often degrades performance in realistic scenarios, but combining token reduction with post-training quantization yields strong compression with minimal accuracy loss, and real-world speedups and memory savings are achievable on consumer hardware. Overall, LLMC+ offers actionable guidelines and a framework to drive fair assessment and development of efficient VLMs for deployment.

Abstract

Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.

Paper Structure

This paper contains 23 sections, 1 equation, 6 figures, 18 tables.

Figures (6)

  • Figure 1: Illustration of our proposed powerful toolkit, LLMC+. Due to its high flexibility and versatility, we build a VLM compression benchmark upon it and conduct an in-depth analysis.
  • Figure 2: The pipeline of removing temporary redundancy in the two steps.
  • Figure 3: Real inference efficiency on LLaVA-NeXT liu2024llavanext.
  • Figure 4: Qualitative results of GQA hudson2019gqa benchmark on LLaVA-1.5-7B liu2023visual.
  • Figure 5: Visual Token similarities in LLaVA-OneVision li2024llava.
  • ...and 1 more figures