Table of Contents
Fetching ...

Sparsity Forcing: Reinforcing Token Sparsity of MLLMs

Feng Chen, Yefei He, Lequan Lin, Chenhui Gou, Jing Liu, Bohan Zhuang, Qi Wu

TL;DR

This paper explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named Sparsity Forcing, which explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, and turns token saving into an end-to-end, inference-consistent optimization objective.

Abstract

Sparse attention mechanisms aim to reduce computational overhead with minimal accuracy loss by selectively processing salient tokens. Despite their effectiveness, most methods merely exploit a model's inherent sparsity and thus plateau at moderate budgets (about 50\% token reduction), with little headroom to push budget lower without hurting accuracy. Other approaches attempt to enforce sparsity through trainable sparse attention or sharpness-inducing regularizers, but these either fix rigid patterns that ignore input and layer dynamics, or optimize proxy objectives without direct control over token budgets. In this paper, we explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named \textit{Sparsity Forcing}. Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards. By contrasting rollouts within each group, the more efficient and correct answer is rewarded while less efficient or incorrect ones are penalized, thereby turning token saving into an end-to-end, inference-consistent optimization objective. Across thirteen image and video benchmarks, Sparsity Forcing raises token reduction ratio on Qwen2-VL/Qwen2.5-VL from 20\% to 75\% with minimal accuracy decline, significantly reducing long-context inference memory by up to 3$\times$ while speeding up decoding by up to 3.3$\times$.

Sparsity Forcing: Reinforcing Token Sparsity of MLLMs

TL;DR

This paper explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named Sparsity Forcing, which explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, and turns token saving into an end-to-end, inference-consistent optimization objective.

Abstract

Sparse attention mechanisms aim to reduce computational overhead with minimal accuracy loss by selectively processing salient tokens. Despite their effectiveness, most methods merely exploit a model's inherent sparsity and thus plateau at moderate budgets (about 50\% token reduction), with little headroom to push budget lower without hurting accuracy. Other approaches attempt to enforce sparsity through trainable sparse attention or sharpness-inducing regularizers, but these either fix rigid patterns that ignore input and layer dynamics, or optimize proxy objectives without direct control over token budgets. In this paper, we explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named \textit{Sparsity Forcing}. Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards. By contrasting rollouts within each group, the more efficient and correct answer is rewarded while less efficient or incorrect ones are penalized, thereby turning token saving into an end-to-end, inference-consistent optimization objective. Across thirteen image and video benchmarks, Sparsity Forcing raises token reduction ratio on Qwen2-VL/Qwen2.5-VL from 20\% to 75\% with minimal accuracy decline, significantly reducing long-context inference memory by up to 3 while speeding up decoding by up to 3.3.

Paper Structure

This paper contains 15 sections, 14 equations, 9 figures, 8 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview of the proposed Sparsity Forcing. We use an MLLM with sparse attention as a policy model, e.g., Qwen2-VL+ZipVL, and the original model with standard causal attention as the reference model. The sampling group is to explore the minimum token ratio required to maintain the current answer under different attention score retention thresholds $p$.
  • Figure 2: Progressive top-$p$ sampling as a low-salience token test. As $p$ increases, additional tail tokens are included; correctness is then evaluated at each $p$ to identify the minimal budget that preserves accuracy.
  • Figure 3: Adjustment under low budgets across different models and benchmarks.
  • Figure 3: Comparisons with baseline methods of enhancing token sparsity on Qwen2.5VL-7b. $\dagger$ denotes post-training MLLMs with ZipVL.
  • Figure 4: (a) The effect of attention scores retention threshold $p$ on token ratio and performance. (b) Accuracy and token budget with respect to increasing token sequence. (c)(d) Prefill latency and decoding memory usage under varying sequence lengths on LLaVA-Video-7b.
  • ...and 4 more figures