Table of Contents
Fetching ...

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou

TL;DR

LLaVA-Scissor introduces Semantic Connected Components (SCC) to identify non-overlapping semantic regions in video tokens and applies a two-step spatio-temporal compression to represent an entire video with a compact set of tokens. By computing token similarities, building an adjacency graph, and solving approximate connected components via Union-Find, SCC merges tokens within each semantic region; spatial SCC is followed by temporal SCC, with a final ToMe-like merging to align source and retained tokens. Across video QA, long-video understanding, and MVBench, LLaVA-Scissor outperforms existing training-free token reduction methods, particularly at low token budgets, while significantly reducing FLOPs prior to LLM processing. The approach reveals substantial token redundancy in video MLLMs and demonstrates robust semantic preservation even under aggressive compression, enabling more efficient video-language understanding on resource-constrained settings.

Abstract

In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

TL;DR

LLaVA-Scissor introduces Semantic Connected Components (SCC) to identify non-overlapping semantic regions in video tokens and applies a two-step spatio-temporal compression to represent an entire video with a compact set of tokens. By computing token similarities, building an adjacency graph, and solving approximate connected components via Union-Find, SCC merges tokens within each semantic region; spatial SCC is followed by temporal SCC, with a final ToMe-like merging to align source and retained tokens. Across video QA, long-video understanding, and MVBench, LLaVA-Scissor outperforms existing training-free token reduction methods, particularly at low token budgets, while significantly reducing FLOPs prior to LLM processing. The approach reveals substantial token redundancy in video MLLMs and demonstrates robust semantic preservation even under aggressive compression, enabling more efficient video-language understanding on resource-constrained settings.

Abstract

In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.

Paper Structure

This paper contains 27 sections, 11 equations, 4 figures, 7 tables, 2 algorithms.

Figures (4)

  • Figure 1: Illustration of different token compression paradigms.$\square$ denotes video tokens, with color representing different semantics. (a) Attention-based methods fail to cover all semantic regions. (b) Segment-based methods introduce temporal redundancy by stacking tokens from each segment. (c) Our two-step spatio-temporal compression strategy is able to identify unique semantic information within each frame and eliminate temporal redundancy, resulting in non-overlapping video tokens.
  • Figure 2: Pipeline of LLaVA-Scissor. (a) The Semantic Connected Components (SCC) compress tokens by extracting connected components from the token set. (b) The two-step spatio-temporal compression strategy that extracts unique semantics by leveraging SCC both spatially and temporally.
  • Figure 3: Token number statistics of similarity threshold $\tau$ and error tolerance $\epsilon$.
  • Figure 4: Performance degradation of methods on different benchmarks as the retained token number decreases. 'RR' denotes the token retention ratio.