LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou
TL;DR
LLaVA-Scissor introduces Semantic Connected Components (SCC) to identify non-overlapping semantic regions in video tokens and applies a two-step spatio-temporal compression to represent an entire video with a compact set of tokens. By computing token similarities, building an adjacency graph, and solving approximate connected components via Union-Find, SCC merges tokens within each semantic region; spatial SCC is followed by temporal SCC, with a final ToMe-like merging to align source and retained tokens. Across video QA, long-video understanding, and MVBench, LLaVA-Scissor outperforms existing training-free token reduction methods, particularly at low token budgets, while significantly reducing FLOPs prior to LLM processing. The approach reveals substantial token redundancy in video MLLMs and demonstrates robust semantic preservation even under aggressive compression, enabling more efficient video-language understanding on resource-constrained settings.
Abstract
In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.
