Table of Contents
Fetching ...

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu

TL;DR

The paper tackles efficient long-video representation by introducing InfoTok, an information-theoretic adaptive video tokenizer. It replaces fixed-length tokenization with an ELBO-guided router that assigns token counts based on content complexity and a transformer-based adaptive compressor that preserves the most informative tokens. The authors prove suboptimality for fixed or data-agnostic tokenizers and demonstrate substantial gains, including about 50% token savings and roughly 2.3x better compression over prior adaptive methods, while maintaining reconstruction quality. The approach generalizes across resolutions and offers a principled pathway to scalable, multimodal video modeling.

Abstract

Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence on performance, and achieving 2.3x compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, InfoTok enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

TL;DR

The paper tackles efficient long-video representation by introducing InfoTok, an information-theoretic adaptive video tokenizer. It replaces fixed-length tokenization with an ELBO-guided router that assigns token counts based on content complexity and a transformer-based adaptive compressor that preserves the most informative tokens. The authors prove suboptimality for fixed or data-agnostic tokenizers and demonstrate substantial gains, including about 50% token savings and roughly 2.3x better compression over prior adaptive methods, while maintaining reconstruction quality. The approach generalizes across resolutions and offers a principled pathway to scalable, multimodal video modeling.

Abstract

Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence on performance, and achieving 2.3x compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, InfoTok enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

Paper Structure

This paper contains 31 sections, 3 theorems, 26 equations, 5 figures, 7 tables, 1 algorithm.

Key Result

Theorem 2.1

For any tokenizer $\mathcal{T}$ with codebook size $C$ that can fully reconstruct video data ${\mathbf{x}} \sim p({\mathbf{x}})$ defined above, we have where $N_{\mathbf{x}}$ is the token sequence length of ${\mathbf{x}}$ assigned by $\mathcal{T}$. Additionally, there exists an adaptive tokenization that has

Figures (5)

  • Figure 1: Overall framework of InfoTok, an information-theoretic adaptive video tokenization method. An encoder maps video ${\mathbf{x}}$ into fixed-length embeddings, from which a router estimates the number of tokens $N_{\mathbf{x}}$ based on information complexity (\ref{['sec:shannon-motivated-training']}). An adaptive compressor later reduces the embeddings to $N_{\mathbf{x}}$ tokens, which are then quantized (\ref{['sec:adaptive-compressor']}). For reconstruction, the tokens are further de-compressed to fixed-length embeddings, and decoded back into video. This adaptive design conditions token length on video complexity: e.g., the stable dog video is compressed more (0.40) than the dynamic cat-fighting video (0.62).
  • Figure 2: Reconstructions examples of video with different complexities using different tokenizers. InfoTok-Flex can achieve similar PSNR with much higher compression (compared to Cosmos-DV), and similar compression rates with better PSNR (compared to ElasticTok).
  • Figure 3: Reconstructions examples of video by InfoTok-Flex with different compression rates.
  • Figure 4: Video tokenization performance of InfoTok-Flex, InfoTok, and ElasticTok on TokenBench (a-c) and DAVIS (d-f). Quality metrics are plotted against $\text{BPP}_{16}$ (bits per 16 pixels). Tokenization efficiency measured in the Number of Function Evaluations overhead (additional NFEs / standard NFEs $\downarrow$) is shown in (g). InfoTok-Flex and InfoTok achieve superior reconstruction quality with smaller $\text{BPP}_{16}$ levels. Additionally, InfoTok is significantly more efficient than ElasticTok, which requires searching to meet thresholds.
  • Figure : Adaptive Tokenizer Training

Theorems & Definitions (5)

  • Theorem 2.1: Shannon Source Coding Theorem (restated) shannon1959coding
  • Theorem 2.2
  • Theorem 3.1
  • proof
  • proof