Table of Contents
Fetching ...

CAT: Content-Adaptive Image Tokenization

Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, Chunting Zhou

TL;DR

CAT introduces a content-adaptive image tokenizer that allocates representation capacity based on image complexity inferred from caption-based descriptions. A caption-LLM scoring system guides a nested VAE to produce variable-shaped latent representations, enabling adaptive compression at $8\times$, $16\times$, and $32\times$ ratios. When paired with diffusion-based generation (DiT), CAT achieves state-of-the-art FID and throughput within the same compute budget, and enables controllable generation by varying token counts. The approach yields improved reconstruction on perceptually challenging images and accelerates learning for generative models, with potential extensions to video and multi-modal tasks.

Abstract

Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.

CAT: Content-Adaptive Image Tokenization

TL;DR

CAT introduces a content-adaptive image tokenizer that allocates representation capacity based on image complexity inferred from caption-based descriptions. A caption-LLM scoring system guides a nested VAE to produce variable-shaped latent representations, enabling adaptive compression at , , and ratios. When paired with diffusion-based generation (DiT), CAT achieves state-of-the-art FID and throughput within the same compute budget, and enables controllable generation by varying token counts. The approach yields improved reconstruction on perceptually challenging images and accelerates learning for generative models, with potential extensions to video and multi-modal tasks.

Abstract

Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
Paper Structure (37 sections, 3 equations, 8 figures, 8 tables)

This paper contains 37 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Content-Adaptive Tokenization. CAT uses an LLM to evaluate the content complexity and determine the optimal compression ratio based on the image's text description. The image is processed by a nested VAE architecture that dynamically routes the input according to the selected compression ratio. The resulting latent representations thus have varying spatial dimensions. Images shown in the figure are taken from COCO 2014 coco.
  • Figure 2: Left: Maximum acceptable compression ratios for COCO images under different error tolerance. We can compress most images more aggressively without compromising reconstruction quality. Right: Pearson correlation between various metrics and max acceptable compression ratio with tolerance 0.0015.
  • Figure 3: Existing metrics can misjudge image complexity. Metrics like JPEG size, MSE, and LPIPS consider images with high contrast and repetitive patterns as complex but underestimate the complexity of text-heavy images that are more challenging for human perception (note the distortion in the bottom two rows). Images shown in the figure are taken from COCO 2014 coco.
  • Figure 4: We highlight the compression ratio selected by our proposed caption complexity in red. On simpler images (top two rows), adjusting the CAT compression ratio does not significantly affect quality. On more complex images (bottom three rows), the impact is substantial. Also note that CAT's text reconstruction is comparable with fixed 8x baseline and better than pretrained LDM VAE. Images shown in the figure are taken from COCO 2014 coco and ChartQA chartqa.
  • Figure 5: Increasing token count (left$\rightarrow$right) for CAT leads to better image quality and higher complexity.
  • ...and 3 more figures