Table of Contents
Fetching ...

3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation

Jinzhi Zhang, Feng Xiong, Mu Xu

TL;DR

The paper tackles the challenge of autoregressive 3D generation by addressing the absence of an effective tokenizer for unordered 3D data. It introduces the Variational Tokenizer (VAT), which uses an in-context transformer and a Variational Vector Quantizer to map 3D features into a Gaussian latent space with hierarchical, cross-scale tokens, followed by a triplane decoder. A second-stage next-scale autoregressive model then generates high-fidelity 3D shapes conditioned on image and text prompts, achieving substantial compression (up to $2000\times$) and superior quality and generalization on Objaverse. The combination of VAT, VVQ, and a Triplane-based decoder enables scalable, efficient 3D generation with strong multi-condition support and outperforms state-of-the-art methods in several metrics, highlighting practical impact for high-fidelity, compact 3D synthesis.

Abstract

Autoregressive transformers have revolutionized high-fidelity image generation. One crucial ingredient lies in the tokenizer, which compresses high-resolution image patches into manageable discrete tokens with a scanning or hierarchical order suitable for large language models. Extending these tokenizers to 3D generation, however, presents a significant challenge: unlike image patches that naturally exhibit spatial sequence and multi-scale relationships, 3D data lacks an inherent order, making it difficult to compress into fewer tokens while preserving structural details. To address this, we introduce the Variational Tokenizer (VAT), which transforms unordered 3D data into compact latent tokens with an implicit hierarchy, suited for efficient and high-fidelity coarse-to-fine autoregressive modeling. VAT begins with an in-context transformer, which compress numerous unordered 3D features into a reduced token set with minimal information loss. This latent space is then mapped to a Gaussian distribution for residual quantization, with token counts progressively increasing across scales. In this way, tokens at different scales naturally establish the interconnections by allocating themselves into different subspaces within the same Gaussian distribution, facilitating discrete modeling of token relationships across scales. During the decoding phase, a high-resolution triplane is utilized to convert these compact latent tokens into detailed 3D shapes. Extensive experiments demonstrate that VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization. Remarkably, VAT achieves up to a 250x compression, reducing a 1MB mesh to just 3.9KB with a 96% F-score, and can further compress to 256 int8 tokens, achieving a 2000x reduction while maintaining a 92% F-score.

3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation

TL;DR

The paper tackles the challenge of autoregressive 3D generation by addressing the absence of an effective tokenizer for unordered 3D data. It introduces the Variational Tokenizer (VAT), which uses an in-context transformer and a Variational Vector Quantizer to map 3D features into a Gaussian latent space with hierarchical, cross-scale tokens, followed by a triplane decoder. A second-stage next-scale autoregressive model then generates high-fidelity 3D shapes conditioned on image and text prompts, achieving substantial compression (up to ) and superior quality and generalization on Objaverse. The combination of VAT, VVQ, and a Triplane-based decoder enables scalable, efficient 3D generation with strong multi-condition support and outperforms state-of-the-art methods in several metrics, highlighting practical impact for high-fidelity, compact 3D synthesis.

Abstract

Autoregressive transformers have revolutionized high-fidelity image generation. One crucial ingredient lies in the tokenizer, which compresses high-resolution image patches into manageable discrete tokens with a scanning or hierarchical order suitable for large language models. Extending these tokenizers to 3D generation, however, presents a significant challenge: unlike image patches that naturally exhibit spatial sequence and multi-scale relationships, 3D data lacks an inherent order, making it difficult to compress into fewer tokens while preserving structural details. To address this, we introduce the Variational Tokenizer (VAT), which transforms unordered 3D data into compact latent tokens with an implicit hierarchy, suited for efficient and high-fidelity coarse-to-fine autoregressive modeling. VAT begins with an in-context transformer, which compress numerous unordered 3D features into a reduced token set with minimal information loss. This latent space is then mapped to a Gaussian distribution for residual quantization, with token counts progressively increasing across scales. In this way, tokens at different scales naturally establish the interconnections by allocating themselves into different subspaces within the same Gaussian distribution, facilitating discrete modeling of token relationships across scales. During the decoding phase, a high-resolution triplane is utilized to convert these compact latent tokens into detailed 3D shapes. Extensive experiments demonstrate that VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization. Remarkably, VAT achieves up to a 250x compression, reducing a 1MB mesh to just 3.9KB with a 96% F-score, and can further compress to 256 int8 tokens, achieving a 2000x reduction while maintaining a 92% F-score.

Paper Structure

This paper contains 24 sections, 5 equations, 21 figures, 5 tables, 1 algorithm.

Figures (21)

  • Figure 1: We propose the Variational Tokenizer (VAT), which compresses unordered 3D data into compact 1D latent tokens with up to $2000\times$ compression, while supporting efficient and high-fidelity 3D generation via autoregressive modeling. (a) 3D shape compression results. (Top row: original high-resolution 3D models, Middle and bottom rows: reconstructed meshes with 1024 and 256 tokens.) (b) 3D generation results using next-scale autoregressive modeling DBLP:journals/corr/abs-2409-06322 conditioned on images (left) and text (right). Each row shows different generated shapes based on the specified input condition, with the arrows indicating the emphasis on either image-based or text-based generation, controlled via Classifier-Free Guidance (CFG) VAR to prioritize each condition.
  • Figure 2: Comparison between (a) conventional tokenizer and (b) our proposed Variational Tokenizer (VAT). In (a), an encoder transforms input features into latent embeddings $Z$, which are directly quantized into discrete tokens. In (b), VAT employs an in-context transformer to compress unordered input features into a reduced token set, which is then mapped to a Gaussian distribution. Quantization is residually applied across scales, allowing tokens to self-organize into distinct subspaces within the same Gaussian distribution, enabling autoregressive next-scale token prediction.
  • Figure 3: Overview of the two-stage training pipeline. (a) Stage 1: Training the Variational Tokenizer (VAT). The process begins with a 3D point cloud that is transformed into point features and compressed into latent tokens using a transformer encoder (Sec. \ref{['ch:vq']}). Variational Vector Quantization (VVQ) maps these latent tokens onto cross-scale discrete tokens. These discrete tokens are decoded into a triplane representation, which is subsequently upsampled and processed by an MLP to generate the dense occupancy volume . (b) Stage 2: Training the Next-Scale Autoregressive Transformer on discrete tokens. Here, discrete tokens generated by VAT are used as supervised signal for a decoder-only transformer trained for next-scale prediction. The model is conditioned on image and text features with a causal attention mask trained by cross-entropy loss (Sec.\ref{['ch: AR_modeling']}).
  • Figure 4: Comparision of state-of-the art 3D generation methods using in-the-wild images. Note that the commercial software displayed on the left may expand thousands of their own data for training, whereas our model is only trained on the Objaverse dataset.
  • Figure 5: VAT enables a robust and generalizable 3D generation conditioned on in-the-wild images.
  • ...and 16 more figures