Table of Contents
Fetching ...

Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

Chen Ju, Haicheng Wang, Haozhe Cheng, Xu Chen, Zhonghua Zhai, Weilin Huang, Jinsong Lan, Shuai Xiao, Bo Zheng

TL;DR

Turbo introduces a data-centric approach to accelerating vision-language large models by pruning tokens via an information-degree metric that jointly accounts for mutual redundancy $\mathcal{R}$ and semantic value $\mathcal{A}$. Implemented as a training-free plug-in, Turbo merges or restores tokens in a way that preserves the informativity $\mathcal{I}(\mathbf{X}) = -\log \mathbb{P}(\mathbf{X})$ while reducing sequence length, with different strategies for understanding and generation tasks. Empirical results across multiple datasets and backbones show Turbo achieves substantial throughput gains (approximately $2\times$ for understanding and $1.6\times$ for generation) with negligible performance loss, and it remains orthogonal to model-centric accelerations like GPTQ and UPop. Theoretical analyses and ablations corroborate the effectiveness of combining mutual redundancy with semantic value, and visualizations illustrate that Turbo largely preserves foreground semantic information while compressing background content. Turbo thus offers a universal, plug-in data-acceleration solution with broad applicability to VLMs.

Abstract

Vision-Language Large Models (VLMs) recently become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in the real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantization, but completely overlook the data-perspective redundancy. To fill the overlook, this paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. In pursuit of efficiency-performance trade-offs, information degree takes two crucial factors into consideration: mutual redundancy and semantic value. Concretely, the former evaluates data duplication between sequential tokens; while the latter evaluates each token by its contribution to the overall semantics. As a result, tokens with high information degree carry less redundancy and stronger semantics. For VLMs' calculation, Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. Its advantages are multifaceted, e.g., being generally compatible to various VLMs across understanding and generation, simple use without re-training and trivial engineering efforts. On multiple VLMs benchmarks, we fully experiment to demonstrate the good acceleration of Turbo, under negligible performance drop.

Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

TL;DR

Turbo introduces a data-centric approach to accelerating vision-language large models by pruning tokens via an information-degree metric that jointly accounts for mutual redundancy and semantic value . Implemented as a training-free plug-in, Turbo merges or restores tokens in a way that preserves the informativity while reducing sequence length, with different strategies for understanding and generation tasks. Empirical results across multiple datasets and backbones show Turbo achieves substantial throughput gains (approximately for understanding and for generation) with negligible performance loss, and it remains orthogonal to model-centric accelerations like GPTQ and UPop. Theoretical analyses and ablations corroborate the effectiveness of combining mutual redundancy with semantic value, and visualizations illustrate that Turbo largely preserves foreground semantic information while compressing background content. Turbo thus offers a universal, plug-in data-acceleration solution with broad applicability to VLMs.

Abstract

Vision-Language Large Models (VLMs) recently become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in the real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantization, but completely overlook the data-perspective redundancy. To fill the overlook, this paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. In pursuit of efficiency-performance trade-offs, information degree takes two crucial factors into consideration: mutual redundancy and semantic value. Concretely, the former evaluates data duplication between sequential tokens; while the latter evaluates each token by its contribution to the overall semantics. As a result, tokens with high information degree carry less redundancy and stronger semantics. For VLMs' calculation, Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. Its advantages are multifaceted, e.g., being generally compatible to various VLMs across understanding and generation, simple use without re-training and trivial engineering efforts. On multiple VLMs benchmarks, we fully experiment to demonstrate the good acceleration of Turbo, under negligible performance drop.
Paper Structure (22 sections, 24 equations, 8 figures, 12 tables)

This paper contains 22 sections, 24 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Left: the trouble with applying VLMs is the high-cost issue. Right: to accelerate VLMs, most existing ideas focus on the model perspective (pruning & quantization). While our Turbo explores de-redundancy from the data perspective.
  • Figure 2: Computing Architecture. As one plug-in module, Turbo compresses data to cut computing overheads for various VLMs, across understanding/generation and uni-/multi-modality. It sorts then merges tokens by information degree (mutual redundancy $\mathcal{R}$ and semantic value $\mathcal{A}$) for understanding tasks; while sorts, merges and restores VLMs’ tokens for generation tasks, owning good universality and practicality.
  • Figure 3: Empirical Evaluation of Token Redundancy & Attention Concentration on BLIP fine-tuned for multi-modal retrieval. Results reveal the non-negligible redundancy in the token sequence from perspectives of semantics and similarity.
  • Figure 4: Ablation Study on Drop Ratio $\Upsilon$. Semantic value retains superior performance when $\Upsilon$ is small, mutual redundancy possesses better stability on the large $\Upsilon$. By combining these two components, Turbo obtains competitive results and stability on the whole scope.
  • Figure 5: Ablation Study of Balancing Coefficient $\alpha$. On image captioning using BLIP (VIT-Base and VIT-Large), these results prove our robustness, as the performance varies slightly.
  • ...and 3 more figures