3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation
Jinzhi Zhang, Feng Xiong, Mu Xu
TL;DR
The paper tackles the challenge of autoregressive 3D generation by addressing the absence of an effective tokenizer for unordered 3D data. It introduces the Variational Tokenizer (VAT), which uses an in-context transformer and a Variational Vector Quantizer to map 3D features into a Gaussian latent space with hierarchical, cross-scale tokens, followed by a triplane decoder. A second-stage next-scale autoregressive model then generates high-fidelity 3D shapes conditioned on image and text prompts, achieving substantial compression (up to $2000\times$) and superior quality and generalization on Objaverse. The combination of VAT, VVQ, and a Triplane-based decoder enables scalable, efficient 3D generation with strong multi-condition support and outperforms state-of-the-art methods in several metrics, highlighting practical impact for high-fidelity, compact 3D synthesis.
Abstract
Autoregressive transformers have revolutionized high-fidelity image generation. One crucial ingredient lies in the tokenizer, which compresses high-resolution image patches into manageable discrete tokens with a scanning or hierarchical order suitable for large language models. Extending these tokenizers to 3D generation, however, presents a significant challenge: unlike image patches that naturally exhibit spatial sequence and multi-scale relationships, 3D data lacks an inherent order, making it difficult to compress into fewer tokens while preserving structural details. To address this, we introduce the Variational Tokenizer (VAT), which transforms unordered 3D data into compact latent tokens with an implicit hierarchy, suited for efficient and high-fidelity coarse-to-fine autoregressive modeling. VAT begins with an in-context transformer, which compress numerous unordered 3D features into a reduced token set with minimal information loss. This latent space is then mapped to a Gaussian distribution for residual quantization, with token counts progressively increasing across scales. In this way, tokens at different scales naturally establish the interconnections by allocating themselves into different subspaces within the same Gaussian distribution, facilitating discrete modeling of token relationships across scales. During the decoding phase, a high-resolution triplane is utilized to convert these compact latent tokens into detailed 3D shapes. Extensive experiments demonstrate that VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization. Remarkably, VAT achieves up to a 250x compression, reducing a 1MB mesh to just 3.9KB with a 96% F-score, and can further compress to 256 int8 tokens, achieving a 2000x reduction while maintaining a 92% F-score.
