FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models

Tianyu Fu; Tengxuan Liu; Qinghao Han; Guohao Dai; Shengen Yan; Huazhong Yang; Xuefei Ning; Yu Wang

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models

Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang

TL;DR

FrameFusion tackles the token explosion in large vision-language models processing long videos by merging highly similar visual tokens across adjacent frames before pruning by importance. The approach is grounded in a thorough analysis showing that similar tokens are most common among corresponding tokens in neighboring frames and that similarity rankings are stable across layers, justifying shallow-layer merging with cascaded reductions. Across six LVLMs and five video benchmarks, FrameFusion reduces tokens by about 70% while maintaining average performance losses under 3%, delivering 1.6–3.6x end-to-end speedups and notable KV-Cache memory savings. The method is simple, broadly applicable, and validated through extensive ablations, efficiency analyses, and scalability experiments.

Abstract

The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens. Existing token reduction methods primarily prune tokens based on importance metrics, such as cumulative attention scores. However, even important tokens may exhibit high redundancy caused by similarity among adjacent video frames and repetitive visual elements. To address this limitation, we propose FrameFusion, a novel token reduction approach integrating similarity-based merging with importance-based pruning. We conduct a thorough study on token similarity characteristics, revealing three key insights: (1) spatially corresponding visual tokens between adjacent frames have higher cosine similarities compared to other token pairs; (2) high token similarities prominently decrease in deeper model layers; and (3) token similarity rankings are highly consistent across different layers. Guided by these observations, FrameFusion computes token similarities exclusively between corresponding visual tokens from adjacent frames, applies token merging at initial successive layers followed by pruning in deeper layers, and adopts a cascaded merging strategy to further enhance efficiency. We evaluate FrameFusion comprehensively across six diverse LVLMs, ranging from 2B to 72B parameters, using five video benchmarks encompassing video retrieval, question-answering, and spatial-temporal understanding tasks. Experiments show that FrameFusion reduces visual tokens by 70%, achieving 1.6-3.6x end-to-end speedups, with an average performance impact of less than 3%. Our code is available at: https://github.com/thu-nics/FrameFusion.

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models

TL;DR

Abstract

Paper Structure (48 sections, 6 equations, 29 figures, 11 tables)

This paper contains 48 sections, 6 equations, 29 figures, 11 tables.

Introduction
Related Work
Large Vision Language Model (LVLMs)
Token Compression
Token Similarity Analysis
Experimental Setup and Definitions
Where Does High Similarity Occur?
What Is the Token Similarity Distribution Across Layers?
Is Token Similarity Ranking Consistent Across Layers?
FrameFusion Design
Two-Stage Token Compression
Design Choice Rationales
Experiment
Setups
Computation-Accuracy Trade-off
...and 33 more sections

Figures (29)

Figure 1: The central idea of FrameFusion. Compared with importance-based token pruning, FrameFusion additionally applies similarity-based token merging, keeping only important and unique visual tokens.
Figure 2: Token similarities among all input tokens at the first LVLM layer in Llava-Video-7B models. For visual clarity, the color bar displays only the top 90% of similarity values. Visual tokens begin at index 14, with 210 tokens per frame.
Figure 3: Heatmap of token similarity across model layers. Each cell represents a similarity range at a specific layer, with color intensity denoting distribution frequency. The line overlay shows the mean token similarity per layer.
Figure 4: Spearman Rank Correlation (SRC) between adjacent layers for the Llava-Video-7B model.
Figure 5: The top-30% retention rate across model layers using different retention metrics and starting layers.
...and 24 more figures

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models

TL;DR

Abstract

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (29)