Table of Contents
Fetching ...

UniComp: Rethinking Video Compression Through Informational Uniqueness

Chao Yuan, Shimin Chen, Minliang Lin, Limeng Qiao, Guanglu Wan, Lin Ma

TL;DR

UniComp reframes video compression around information uniqueness rather than attention, formulating the problem as minimizing conditional entropy H(X|S) and deriving a reconstruction-error bound linked to token uniqueness. The framework integrates Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression to adaptively reduce temporal, global, and spatial redundancy while preserving semantically unique content. Theoretical bounds and extensive long-video experiments demonstrate that UniComp outperforms state-of-the-art, with strong robustness across backbones and frame-length scales and improved efficiency. This approach offers a practical, plug-and-play solution for scalable multimodal video understanding on long sequences.

Abstract

Distinct from attention-based compression methods, this paper presents an information uniqueness driven video compression framework, termed UniComp, which aims to maximize the information fidelity of video representations under constrained computational budgets. Starting from the information-theoretic perspective, we formulate the vision compression as an optimization problem that minimizes conditional entropy (reconstruction error) between retained and full tokens. To achieve this, we introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error. Based on uniqueness, we design three modules-Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression-that progressively perform semantic frame grouping, adaptive resource allocation, and fine-grained spatial compression. Extensive experiments demonstrate that UniComp consistently outperforms existing compression methods in preserving essential visual tokens under limited computational budgets, highlighting the pivotal role of information uniqueness in token compression efficacy.

UniComp: Rethinking Video Compression Through Informational Uniqueness

TL;DR

UniComp reframes video compression around information uniqueness rather than attention, formulating the problem as minimizing conditional entropy H(X|S) and deriving a reconstruction-error bound linked to token uniqueness. The framework integrates Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression to adaptively reduce temporal, global, and spatial redundancy while preserving semantically unique content. Theoretical bounds and extensive long-video experiments demonstrate that UniComp outperforms state-of-the-art, with strong robustness across backbones and frame-length scales and improved efficiency. This approach offers a practical, plug-and-play solution for scalable multimodal video understanding on long sequences.

Abstract

Distinct from attention-based compression methods, this paper presents an information uniqueness driven video compression framework, termed UniComp, which aims to maximize the information fidelity of video representations under constrained computational budgets. Starting from the information-theoretic perspective, we formulate the vision compression as an optimization problem that minimizes conditional entropy (reconstruction error) between retained and full tokens. To achieve this, we introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error. Based on uniqueness, we design three modules-Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression-that progressively perform semantic frame grouping, adaptive resource allocation, and fine-grained spatial compression. Extensive experiments demonstrate that UniComp consistently outperforms existing compression methods in preserving essential visual tokens under limited computational budgets, highlighting the pivotal role of information uniqueness in token compression efficacy.

Paper Structure

This paper contains 46 sections, 21 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Left: Compare UniComp with state-of-the-art methods (VisionZip yang2025visionzip and HoliTom shao2025holitom) on Eagle2.5 chen2025eagle model under three retained ratio settings. Input is 32 frames with 256 tokens of each frame. Words in Green means right, and Red means wrong. UniComp could recognize even only retained 5% tokens, although contains wrong words, but the words "PEPPERMINT TEA" on the tea box, which is surprising. Right: Performance compared to SOTA methods, UniComp could even surpass baseline which has not been compressed.
  • Figure 2: Real visualization on LLaVA-OneVision-7B li2024llava of differences between attention-based (like VisionZip yang2025visionzip and HoliTom shao2025holitom) and our uniqueness-based token selection. We select top-20 tokens with red rectangles and labeled with orders. Baseline attention-based selection is redundant and misses key content, while ours captures essential information with diverse coverage.
  • Figure 3: Framework of UniComp. It has three modules: Frame Group Fusion (FGF), Token Allocation (TA), and Spatial Dynamic Compression (SDC). Right part shows retained tokens selection and fusion. Red rectangles are retained token labeled with orders, and we visualize four token (1/2/4/13) fusion with four colors (token with the same color will be fused into the token with red rectangle).
  • Figure 4: Left: comparison with SOTA compression methods under different retained ratios, UniComp outperforms all and even better than full tokens without compression. Right: comparison with SOTA compression methods under different frames (but same token limitation 32 frames $\times$ 196 tokens)
  • Figure 5: Comparison with Vanilla method with full set token on efficiency. UniComp achieves up to 4× faster Time-To-First-Token (TTFT), demonstrating superior efficiency on long videos.
  • ...and 9 more figures