ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Qianhao Yuan; Qingyu Zhang; Yanjiang Liu; Jiawei Chen; Yaojie Lu; Hongyu Lin; Jia Zheng; Xianpei Han; Le Sun

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun

TL;DR

This work tackles the high computational cost of Multimodal LLMs (MLLMs) driven by abundant visual tokens. It introduces Layer Contribution (LC), a logit-based metric that quantifies how much each layer contributes to processing specific tokens, revealing strong layer-wise redundancy for visual tokens. Building on LC, ShortV freezes visual-token updates in the layers with the lowest LC, replacing dense layers with sparse ShortV layers in a training-free manner; this yields around a 60% replacement of layers and up to a 50% reduction in FLOPs on LLaVA-NeXT-13B while maintaining competitive performance, and it is compatible with token-pruning methods like FastV. Empirically, ShortV delivers comparable or superior results across several vision-language benchmarks, demonstrates favorable efficiency–performance trade-offs, and provides a practical pathway to faster MLLMs without additional training or complex architecture changes. The approach highlights the distinct processing patterns of visual versus text tokens and offers a modular, hardware-friendly route to deploy capable MLLMs at scale.

Abstract

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

TL;DR

Abstract

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)