Table of Contents
Fetching ...

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun

TL;DR

This work tackles the high computational cost of Multimodal LLMs (MLLMs) driven by abundant visual tokens. It introduces Layer Contribution (LC), a logit-based metric that quantifies how much each layer contributes to processing specific tokens, revealing strong layer-wise redundancy for visual tokens. Building on LC, ShortV freezes visual-token updates in the layers with the lowest LC, replacing dense layers with sparse ShortV layers in a training-free manner; this yields around a 60% replacement of layers and up to a 50% reduction in FLOPs on LLaVA-NeXT-13B while maintaining competitive performance, and it is compatible with token-pruning methods like FastV. Empirically, ShortV delivers comparable or superior results across several vision-language benchmarks, demonstrates favorable efficiency–performance trade-offs, and provides a practical pathway to faster MLLMs without additional training or complex architecture changes. The approach highlights the distinct processing patterns of visual versus text tokens and offers a modular, hardware-friendly route to deploy capable MLLMs at scale.

Abstract

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

TL;DR

This work tackles the high computational cost of Multimodal LLMs (MLLMs) driven by abundant visual tokens. It introduces Layer Contribution (LC), a logit-based metric that quantifies how much each layer contributes to processing specific tokens, revealing strong layer-wise redundancy for visual tokens. Building on LC, ShortV freezes visual-token updates in the layers with the lowest LC, replacing dense layers with sparse ShortV layers in a training-free manner; this yields around a 60% replacement of layers and up to a 50% reduction in FLOPs on LLaVA-NeXT-13B while maintaining competitive performance, and it is compatible with token-pruning methods like FastV. Empirically, ShortV delivers comparable or superior results across several vision-language benchmarks, demonstrates favorable efficiency–performance trade-offs, and provides a practical pathway to faster MLLMs without additional training or complex architecture changes. The approach highlights the distinct processing patterns of visual versus text tokens and offers a modular, hardware-friendly route to deploy capable MLLMs at scale.

Abstract

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

Paper Structure

This paper contains 28 sections, 5 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: (a) Illustration of ShortV. We identify ineffective layers for visual tokens and replace these layers with sparse ShortV layers. In ShortV layers, we freeze visual tokens, and eliminate computations related to updating them. ShortV improves MLLM efficiency in a training-free manner and involves no parameter updates. Notably, ShortV is compatible with token pruning methods, e.g. FastV. (b) Performance vs. the number of ShortV layers. Average Performance means a normalized average score on multiple benchmarks. ShortV can freeze visual tokens in approximately 60% of the MLLM layers with nearly no performance degradation.
  • Figure 2: Sparse layers used to investigate layer redundancy for different tokens. To investigate layer redundancy for certain tokens, we freeze these tokens within the layer, i.e. keep hidden states of these tokens unchanged, and measure the divergence between the model's output logits and those of the original model. We gray out the attention that does not need calculation.
  • Figure 3: The Layer Contribution (LC) scores of LLaVA-1.5-7B and LLaVA-1.5-13B. A lower LC score implies that the layer's transformations on the specified tokens are more ineffective. Layers are more ineffective for visual tokens than for text tokens, and freezing visual tokens in ineffective layers results in minimal output divergence from the original model.
  • Figure 4: Details of ShortV layer. In this layer, only text tokens pass through the $W_Q$ and $W_O$ matrices and the FFN. The attention mask is same as that in Figure \ref{['visual']}, where visual tokens do not attend to other tokens, and only text tokens function as queries.