Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Zeliang Zhang; Phu Pham; Wentian Zhao; Kun Wan; Yu-Jhe Li; Jianing Zhou; Daniel Miranda; Ajinkya Kale; Chenliang Xu

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Zeliang Zhang, Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Jianing Zhou, Daniel Miranda, Ajinkya Kale, Chenliang Xu

TL;DR

This work tackles the high computational cost of processing dense visual tokens in multimodal LLMs by revealing redundancy in visual computations. It reframes efficiency from token pruning to pruning the computation pattern: neighbor-aware visual attention, inactive visual-head pruning, sparse FFN projection, and lazy layer dropping, encapsulated in the You Only Need to Prune Once (YOPO) approach. Applied to LLaVA and validated on other MLLMs (Qwen2-VL-7B, InternVL-2.0), the method achieves up to 88% reduction in visual computation with minimal performance loss, and demonstrates generality across models and benchmarks. The results offer a scalable, training-free path to deploying dense visual-token MLLMs in real-world settings, with code and checkpoints to support adoption and further research.

Abstract

By treating visual tokens from visual encoders as text tokens, Multimodal Large Language Models (MLLMs) have achieved remarkable progress across diverse visual understanding tasks, leveraging the robust architectures of Large Language Models (LLMs). However, as token counts grow, the quadratic scaling of computation in LLMs introduces a significant efficiency bottleneck, impeding further scalability. Although recent approaches have explored pruning visual tokens or employing lighter LLM architectures, the computational overhead from an increasing number of visual tokens remains a substantial challenge. In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA, a representative MLLM, and introduce a suite of streamlined strategies to enhance efficiency. These include neighbor-aware visual token attention, pruning of inactive visual attention heads, and selective layer dropping for visual computations. By implementing these strategies in LLaVA, we achieve a reduction in computational demands of 88% while maintaining model performance across key benchmarks. Additionally, we validate the existence of visual computational redundancy in other MLLMs, such as Qwen2-VL-7B and InternVL-2.0-4B/8B/26B. These results present a novel pathway for MLLMs to handle dense visual tokens with minimal computational costs. Code and model checkpoints will be released to support further research.

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

TL;DR

Abstract

Paper Structure (21 sections, 7 equations, 13 figures, 9 tables)

This paper contains 21 sections, 7 equations, 13 figures, 9 tables.

Introduction
Related work
Efficient MLLMs
LLM pruning
Method
Redundancy of visual computation in MLLMs
Neighbor tokens matter for visual attention
Sparse Visual Projection in FFN
Not all visual attention heads are equal
Your model only needs text for deeper layers
Evaluations
Experiment setup
Comparison with efficient LLaVA models
Pruning with different granularity
Computational redundancy beyond LLaVA
...and 6 more sections

Figures (13)

Figure 1: Compared to text tokens processed by language prompts, visual tokens are significantly more numerous, leading to substantial computational overhead in MLLMs. However, visual information is often sparser, resulting in considerable redundancy within visual computations. In this work, we propose pruning these redundant computations at both the parameter and computational pattern levels to improve processing efficiency.
Figure 2: Visualization of attention weights for randomly selected vision tokens interacting with other visual tokens at varying spatial distances across different layers in LLaVA. Notably, the attention weights are predominantly concentrated on neighboring visual tokens.
Figure 3: Visualization of the cross-modal attention weights between vision and text across varying layers. Each line represents the mean attention of an individual text token directed toward all other vision tokens, illustrating how attention varies as layers progress.
Figure 4: Overview of our method. Our approach replaces the traditional attention block with a neighbor-aware visual attention mechanism, reducing computational complexity from quadratic to linear with respect to the number of visual tokens. Additionally, we prune inactive attention heads in the visual computation, focusing only on the most impactful components. To further decrease computation overhead, we disable visual processing in the later layers, where visual information has minimal impact on the task.
Figure 5: Visualization of $\rho$ of different attention heads in different layers of LLaVA.
...and 8 more figures

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

TL;DR

Abstract

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Authors

TL;DR

Abstract

Table of Contents

Figures (13)