InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression
Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu
TL;DR
The paper tackles efficient long-video representation by introducing InfoTok, an information-theoretic adaptive video tokenizer. It replaces fixed-length tokenization with an ELBO-guided router that assigns token counts based on content complexity and a transformer-based adaptive compressor that preserves the most informative tokens. The authors prove suboptimality for fixed or data-agnostic tokenizers and demonstrate substantial gains, including about 50% token savings and roughly 2.3x better compression over prior adaptive methods, while maintaining reconstruction quality. The approach generalizes across resolutions and offers a principled pathway to scalable, multimodal video modeling.
Abstract
Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence on performance, and achieving 2.3x compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, InfoTok enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.
