Table of Contents
Fetching ...

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen

TL;DR

The paper interrogates whether scaling a visual tokenizer—specifically ViTok, a Vision Transformer-based auto-encoder—improves downstream image and video generation. It conducts a comprehensive study of three scaling axes: the bottleneck size E, the encoder, and the decoder, across large-scale image and video datasets, revealing that E strongly governs reconstruction while encoder scaling offers little to no benefit and decoder scaling yields mixed results for generation. Despite these findings, ViTok achieves competitive or state-of-the-art reconstruction results with substantially fewer FLOPs and, when paired with Diffusion Transformers, delivers strong or state-of-the-art generation performance on ImageNet-1K and UCF-101 video benchmarks. The work highlights that effective visual tokenization requires balancing reconstruction quality and generative ease, suggesting that future improvements should focus more on the downstream generator rather than tokenizer scaling alone. Overall, ViTok demonstrates efficient, high-quality image and video tokenization with favorable trade-offs for practical high-resolution generation tasks.

Abstract

Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

TL;DR

The paper interrogates whether scaling a visual tokenizer—specifically ViTok, a Vision Transformer-based auto-encoder—improves downstream image and video generation. It conducts a comprehensive study of three scaling axes: the bottleneck size E, the encoder, and the decoder, across large-scale image and video datasets, revealing that E strongly governs reconstruction while encoder scaling offers little to no benefit and decoder scaling yields mixed results for generation. Despite these findings, ViTok achieves competitive or state-of-the-art reconstruction results with substantially fewer FLOPs and, when paired with Diffusion Transformers, delivers strong or state-of-the-art generation performance on ImageNet-1K and UCF-101 video benchmarks. The work highlights that effective visual tokenization requires balancing reconstruction quality and generative ease, suggesting that future improvements should focus more on the downstream generator rather than tokenizer scaling alone. Overall, ViTok demonstrates efficient, high-quality image and video tokenization with favorable trade-offs for practical high-resolution generation tasks.

Abstract

Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.
Paper Structure (42 sections, 3 equations, 25 figures, 7 tables)

This paper contains 42 sections, 3 equations, 25 figures, 7 tables.

Figures (25)

  • Figure 1: Our learnings from scaling ViTok. We showcase our ViTok architecture (left) and key findings (right) from scaling auto-encoders for image and video reconstruction and generation. We enhance traditional CNN-based auto-encoders by integrating Vision Transformers (ViTs) with an upgraded Llama architecture into an asymmetric auto-encoder framework forming Vision Transformer Tokenizer or ViTok. Visual inputs are embedded as patches or tubelets, processed by a compact Llama Encoder, and bottlenecked to create a latent code. The encoded representation is then upsampled and handled by a larger Llama Decoder to reconstruct the input. Color-coded text boxes highlight the effects of scaling the encoder, adjusting the bottleneck size, and expanding the decoder. Additionally, we discuss trade-offs in loss optimization and the model's adaptability to video data. Our best performing ViTok variant achieves competitive performance with prior state-of-the-art tokenizers while reducing computational burden.
  • Figure 2: 256p image reconstruction sweep over floating points $E$. We evaluate ViTok S-B trained with stage 1 (Section \ref{['sec:Experimental_Setup']}) using combinations of patch sizes $p \in {8, 16, 32}$ and channel widths $c \in {4, 8, 16, 32, 64}$ to investigate how the total floating points $E = \frac{256^2}{p^2} \cdot c$ influences FID, IS, SSIM, and PSNR in reconstruction tasks. Our findings reveal a strong correlation between $\log(E)$ and $\log(\text{rFID})$, $\log(E)$ and $\text{rIS}$, $\log(E)$ and $\text{rSSIM}$, as well as $\log(E)$ and $\text{rPSNR}$, independent of the number of FLOPs utilized by the auto-encoder. This indicates that $E$ is the primary bottleneck for reconstruction, irrespective of the code shape or FLOPs expended. Additionally, similar trends are observed across the ImageNet-1K and COCO datasets, indicating that these patterns are consistent regardless of the dataset used.
  • Figure 3: 256p image reconstruction visualization over floating points $E$. Example reconstructions for varying the number of floating points $E$ values on ViTok S-B/16, achieved by adjusting the channel size $c = {64, 32, 16, 8, 4}$ for each image across the row. As $E$ decreases, high-frequency details diminish, with small colors and fine details gradually lost. When $E < 4096$, textures merge, and significant detail loss becomes apparent.
  • Figure 4: 512p Image reconstruction over $E$. We evaluate ViTok S-B trained with stage 1 (Section \ref{['sec:Experimental_Setup']}) across all combinations of patch sizes $p \in {8, 16, 32}$ and a fixed channel width $c = 16$, analyzing how the total floating-point operations, calculated as $E = \frac{512^2}{p^2} \cdot c$, influence reconstruction metrics such as FID, IS, SSIM, and PSNR. $E$ shows trends similar to 256p results (Figure \ref{['fig:256p_image_sweep']}). However, achieving comparable rPSNR/rSSIM to 256p requires $4 \times E$ for 512p reconstruction, which indicates that compression ratio of pixels to channels should be fixed to maintain performance.
  • Figure 5: 256p image generation over $E$. We evaluate each tokenizer from Figure \ref{['fig:256p_image_sweep']} on DiT following Section \ref{['sec:Experimental_Setup']}. Results for CFG scales of 1.5 and 3.0 are on the left two and right two plots respectively. Our results show no strong linear correlation between $\log(E)$ and generation performance. Instead, a second-order trend reveals an optimal $E$ for each patch size $p$, indicating a complex interplay between $E$ and $c$. This highlights the necessity of optimizing both parameters to balance reconstruction quality with generative capabilities.
  • ...and 20 more figures