Table of Contents
Fetching ...

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, Emad Barsoum

TL;DR

This work tackles the bottleneck of tokenization efficiency in denoising-based image generation by introducing SoftVQ-VAE, a fully differentiable continuous tokenizer that uses a soft categorical posterior to aggregate multiple codewords per latent token, enabling 32–64 latent tokens. By integrating a ViT-based encoder–decoder and aligning latent representations with pre-trained vision features, SoftVQ-VAE delivers high-quality reconstructions and enables state-of-the-art or competitive generation across diffusion, flow, and autoregressive backbones with substantially fewer tokens. The approach yields large runtime and training efficiency gains (up to 55x inference speedups and 2.3x faster training) while maintaining competitive FID/IS metrics, and it supports extensions like GMMVQ-VAE and compatibility with PQ/RQ. Overall, SoftVQ-VAE offers a scalable, semantically meaningful tokenizer that improves both the efficiency and quality of large-scale generative vision models, with code and models released for public use.

Abstract

Efficient image tokenization with high compression ratios remains a critical challenge for training generative models. We present SoftVQ-VAE, a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token, substantially increasing the representation capacity of the latent space. When applied to Transformer-based architectures, our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens. Not only does SoftVQ-VAE show consistent and high-quality reconstruction, more importantly, it also achieves state-of-the-art and significantly faster image generation results across different denoising-based generative models. Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images while achieving competitive FID scores of 1.78 and 2.21 for SiT-XL. It also improves the training efficiency of the generative models by reducing the number of training iterations by 2.3x while maintaining comparable performance. With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VAE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models. Code and model are released.

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

TL;DR

This work tackles the bottleneck of tokenization efficiency in denoising-based image generation by introducing SoftVQ-VAE, a fully differentiable continuous tokenizer that uses a soft categorical posterior to aggregate multiple codewords per latent token, enabling 32–64 latent tokens. By integrating a ViT-based encoder–decoder and aligning latent representations with pre-trained vision features, SoftVQ-VAE delivers high-quality reconstructions and enables state-of-the-art or competitive generation across diffusion, flow, and autoregressive backbones with substantially fewer tokens. The approach yields large runtime and training efficiency gains (up to 55x inference speedups and 2.3x faster training) while maintaining competitive FID/IS metrics, and it supports extensions like GMMVQ-VAE and compatibility with PQ/RQ. Overall, SoftVQ-VAE offers a scalable, semantically meaningful tokenizer that improves both the efficiency and quality of large-scale generative vision models, with code and models released for public use.

Abstract

Efficient image tokenization with high compression ratios remains a critical challenge for training generative models. We present SoftVQ-VAE, a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token, substantially increasing the representation capacity of the latent space. When applied to Transformer-based architectures, our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens. Not only does SoftVQ-VAE show consistent and high-quality reconstruction, more importantly, it also achieves state-of-the-art and significantly faster image generation results across different denoising-based generative models. Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images while achieving competitive FID scores of 1.78 and 2.21 for SiT-XL. It also improves the training efficiency of the generative models by reducing the number of training iterations by 2.3x while maintaining comparable performance. With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VAE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models. Code and model are released.

Paper Structure

This paper contains 33 sections, 13 equations, 28 figures, 10 tables.

Figures (28)

  • Figure 1: ImageNet-1K 256$\times$256 and 512$\times$512 generation results of generative models trained on SoftVQ-VAE with 32 and 64 tokens.
  • Figure 2: Illustration of SoftVQ-VAE. Left: Transformer encoder-decoder architecture with image tokens, arbitrary length of latent tokens, and mask tokens. Right top: fully-differentiable SoftVQ illustration. Right bottom: latent space representation alignment.
  • Figure 3: Linear probing accuracy of ImageNet-1K val. set on (a) latent tokens of tokenizer and (b) intermediate features (layer 20) of SiT (L for small and XL for others) trained on latents of tokenizer.
  • Figure 4: Visualization of $\hat{\mathbf{z}}$ (top), i.e., encoder output, and $\mathbf{z}$ (bottom), i.e., decoder input, of (a) VQ-S 64; (b) SoftVQ-S 64; (c) SoftVQ-S 32; (d) SoftVQ-L 32, trained with latent space alignment.
  • Figure 5: Codewords visualization
  • ...and 23 more figures