Table of Contents
Fetching ...

MacTok: Robust Continuous Tokenization for Image Generation

Hengyu Zeng, Xin Gao, Guanghao Li, Yuxiang Yan, Jiaoyang Ruan, Junpeng Ma, Haoyu Albert Wang, Jian Pu

Abstract

Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce \textbf{MacTok}, a \textbf{M}asked \textbf{A}ugmenting 1D \textbf{C}ontinuous \textbf{Tok}enizer that leverages image masking and representation alignment to prevent collapse while learning compact and robust representations. MacTok applies both random masking to regularize latent learning and DINO-guided semantic masking to emphasize informative regions in images, forcing the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, MacTok preserves rich discriminative information in a highly compressed 1D latent space, requiring only 64 or 128 tokens. On ImageNet, MacTok achieves a competitive gFID of 1.44 at 256$\times$256 and a state-of-the-art 1.52 at 512$\times$512 with SiT-XL, while reducing token usage by up to 64$\times$. These results confirm that masking and semantic guidance together prevent posterior collapse and achieve efficient, high-fidelity tokenization.

MacTok: Robust Continuous Tokenization for Image Generation

Abstract

Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce \textbf{MacTok}, a \textbf{M}asked \textbf{A}ugmenting 1D \textbf{C}ontinuous \textbf{Tok}enizer that leverages image masking and representation alignment to prevent collapse while learning compact and robust representations. MacTok applies both random masking to regularize latent learning and DINO-guided semantic masking to emphasize informative regions in images, forcing the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, MacTok preserves rich discriminative information in a highly compressed 1D latent space, requiring only 64 or 128 tokens. On ImageNet, MacTok achieves a competitive gFID of 1.44 at 256256 and a state-of-the-art 1.52 at 512512 with SiT-XL, while reducing token usage by up to 64. These results confirm that masking and semantic guidance together prevent posterior collapse and achieve efficient, high-fidelity tokenization.

Paper Structure

This paper contains 32 sections, 17 equations, 34 figures, 9 tables.

Figures (34)

  • Figure 1: Effect of random masking in continuous tokenizers. Left: plain KL-VAE, latent token masking, and image token masking, with only the latter preventing posterior collapse. Right: collapsed latent space shows poor structure, while the uncollapsed one yields well-structured and diverse representations.
  • Figure 2: Generation results produced by generative models with MacTok using 64 and 128 tokens on ImageNet at 256$\times$256 and 512$\times$512.
  • Figure 3: Generation performance of MacTok with varying mask ratios sampled up to $M$ as detailed in \ref{['sec:method3.2']}. The orange star corresponds to random and semantic mask with equal probability.
  • Figure 4: Overview of the MacTok framework.Top: Transformer-based encoder and decoder operating on image, latent, and mask tokens. Bottom left: DINO-guided image masking introduces semantic priors. Bottom center: Global and local representation alignment between latent and pretrained visual representations. Bottom right: Discriminator and perceptual networks provide auxiliary supervision.
  • Figure 5: Visualization of latent space from (a) Collapsed; (b) MacTok-128 trained without representation alignment; (c) MacTok-128
  • ...and 29 more figures