Table of Contents
Fetching ...

CODA: Repurposing Continuous VAEs for Discrete Tokenization

Zeyu Liu, Zanlin Ni, Yeguo Hua, Xin Deng, Xiao Ma, Cheng Zhong, Gao Huang

TL;DR

This work tackles the instability and poor codebook utilization of discrete tokenizers by decoupling compression from discretization. It repurposes off-the-shelf continuous VAEs for perceptual compression and adds a carefully designed discretization pipeline—comprising residual quantization, attention-based sparsity, and LoRA-based adaptation—to yield a fully utilized codebook with high reconstruction fidelity. Empirical results on ImageNet show CODA achieves $rFID$ values of 0.43 and 1.34 at 8× and 16× compression, respectively, while reducing training compute by about 6× and enabling competitive discrete generation when combined with MaskGIT. The approach bridges continuous and discrete generation paradigms, delivering accurate, efficient token-based image synthesis with strong practical impact for scalable AIGC pipelines.

Abstract

Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a compact representation and discretizing them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce \textbf{CODA}(\textbf{CO}ntinuous-to-\textbf{D}iscrete \textbf{A}daptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs -- already optimized for perceptual compression -- into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, with $\mathbf{6 \times}$ less training budget than standard VQGAN, our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $\mathbf{0.43}$ and $\mathbf{1.34}$ for $8 \times$ and $16 \times$ compression on ImageNet 256$\times$ 256 benchmark.

CODA: Repurposing Continuous VAEs for Discrete Tokenization

TL;DR

This work tackles the instability and poor codebook utilization of discrete tokenizers by decoupling compression from discretization. It repurposes off-the-shelf continuous VAEs for perceptual compression and adds a carefully designed discretization pipeline—comprising residual quantization, attention-based sparsity, and LoRA-based adaptation—to yield a fully utilized codebook with high reconstruction fidelity. Empirical results on ImageNet show CODA achieves values of 0.43 and 1.34 at 8× and 16× compression, respectively, while reducing training compute by about 6× and enabling competitive discrete generation when combined with MaskGIT. The approach bridges continuous and discrete generation paradigms, delivering accurate, efficient token-based image synthesis with strong practical impact for scalable AIGC pipelines.

Abstract

Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a compact representation and discretizing them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce \textbf{CODA}(\textbf{CO}ntinuous-to-\textbf{D}iscrete \textbf{A}daptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs -- already optimized for perceptual compression -- into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, with less training budget than standard VQGAN, our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of and for and compression on ImageNet 256 256 benchmark.

Paper Structure

This paper contains 26 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) Conventional discrete VQ tokenizers learn to compress and discretize inherently continuous visual signals into codes simultaneously. This lead to multiple challenges in training and the corresponding unsatisfactory latent space poses a bottleneck that limit the performance of discrete token-based generation models. (b) Our proposed CODA tokenizer leverages continuous VAEs for compression, directly discretizing the latent space. (c) Quantitative comparisons between VQGAN esser2021taming and our proposed CODA tokenizer.
  • Figure 2: Illustration of our CODA tokenizer. (a) a residual quantization process of $L$ levels iteratively refines the approximation of a continuous VAE vector $f$ through a composite of multiple quantization layers, thus progressively minimizing the quantization error. Meanwhile, as the continuous VAE vector is approximated by a combination of $L$ discrete codes, the representational capacity is significantly enlarged. (b) the attention-based quantization process frames discretization as a retrieval task. Continuous features and codebook embeddings are projected and normalized onto a normed hypersphere, where the softmax attention matrix is computed to determine the confidence of code selection. As codes compete within the softmax attention framework, this approach ensures a sparse and unambiguous assignment.
  • Figure 3: Visualization of latent space approximation: (a) the original latent space of the continuous VAE, (b) latent space approximated by vector quantization and (c) latent space approximated by residual quantization.
  • Figure 4: Effect of residual quantization levels on tokenizer performance. With more levels of residual quantization, quantization error is consistently minimized, and the reconstruction performance (measured by rFID) steadily improves.
  • Figure 5: Visualization of top assignment confidence scores for 16 randomly selected continuous VAE features. For vector quantization, we visualize the distance of codes to the continuous feature, with lower distance representing higher confidence.
  • ...and 2 more figures