CAT: Content-Adaptive Image Tokenization
Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, Chunting Zhou
TL;DR
CAT introduces a content-adaptive image tokenizer that allocates representation capacity based on image complexity inferred from caption-based descriptions. A caption-LLM scoring system guides a nested VAE to produce variable-shaped latent representations, enabling adaptive compression at $8\times$, $16\times$, and $32\times$ ratios. When paired with diffusion-based generation (DiT), CAT achieves state-of-the-art FID and throughput within the same compute budget, and enables controllable generation by varying token counts. The approach yields improved reconstruction on perceptually challenging images and accelerates learning for generative models, with potential extensions to video and multi-modal tasks.
Abstract
Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
