Table of Contents
Fetching ...

G3PT: Unleash the power of Autoregressive Modeling in 3D Generation via Cross-scale Querying Transformer

Jinzhi Zhang, Feng Xiong, Mu Xu

TL;DR

G3PT reframes 3D generation as a cross-scale autoregressive task by mapping unordered point-based data into discrete tokens across multiple levels of detail. The core innovations are the Cross-scale Querying Transformer (CQT), including Cross-scale Vector Quantization (CVQ) for tokenization and Cross-scale AutoRegressive Modeling (CAR) for next-scale prediction, which enable global, order-agnostic token interactions and coarse-to-fine generation. The model demonstrates state-of-the-art results on 3D content creation, with strong conditioning capabilities from image and text inputs and surprising power-law scaling behavior as model size increases. This work provides a scalable, flexible framework for 3D autoregressive generation that avoids artificial ordering and leverages cross-scale interactions to capture complex geometry and semantics, with practical implications for texture synthesis and multi-modal 3D content creation.

Abstract

Autoregressive transformers have revolutionized generative models in language processing and shown substantial promise in image and video generation. However, these models face significant challenges when extended to 3D generation tasks due to their reliance on next-token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Instead of imposing an artificial order on 3D data, in this paper, we introduce G3PT, a scalable coarse-to-fine 3D generative model utilizing a cross-scale querying transformer. The key is to map point-based 3D data into discrete tokens with different levels of detail, naturally establishing a sequential relationship between different levels suitable for autoregressive modeling. Additionally, the cross-scale querying transformer connects tokens globally across different levels of detail without requiring an ordered sequence. Benefiting from this approach, G3PT features a versatile 3D generation pipeline that effortlessly supports diverse conditional structures, enabling the generation of 3D shapes from various types of conditions. Extensive experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous 3D generation methods. Most importantly, for the first time in 3D generation, scaling up G3PT reveals distinct power-law scaling behaviors.

G3PT: Unleash the power of Autoregressive Modeling in 3D Generation via Cross-scale Querying Transformer

TL;DR

G3PT reframes 3D generation as a cross-scale autoregressive task by mapping unordered point-based data into discrete tokens across multiple levels of detail. The core innovations are the Cross-scale Querying Transformer (CQT), including Cross-scale Vector Quantization (CVQ) for tokenization and Cross-scale AutoRegressive Modeling (CAR) for next-scale prediction, which enable global, order-agnostic token interactions and coarse-to-fine generation. The model demonstrates state-of-the-art results on 3D content creation, with strong conditioning capabilities from image and text inputs and surprising power-law scaling behavior as model size increases. This work provides a scalable, flexible framework for 3D autoregressive generation that avoids artificial ordering and leverages cross-scale interactions to capture complex geometry and semantics, with practical implications for texture synthesis and multi-modal 3D content creation.

Abstract

Autoregressive transformers have revolutionized generative models in language processing and shown substantial promise in image and video generation. However, these models face significant challenges when extended to 3D generation tasks due to their reliance on next-token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Instead of imposing an artificial order on 3D data, in this paper, we introduce G3PT, a scalable coarse-to-fine 3D generative model utilizing a cross-scale querying transformer. The key is to map point-based 3D data into discrete tokens with different levels of detail, naturally establishing a sequential relationship between different levels suitable for autoregressive modeling. Additionally, the cross-scale querying transformer connects tokens globally across different levels of detail without requiring an ordered sequence. Benefiting from this approach, G3PT features a versatile 3D generation pipeline that effortlessly supports diverse conditional structures, enabling the generation of 3D shapes from various types of conditions. Extensive experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous 3D generation methods. Most importantly, for the first time in 3D generation, scaling up G3PT reveals distinct power-law scaling behaviors.
Paper Structure (19 sections, 7 equations, 6 figures, 4 tables)

This paper contains 19 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overall pipeline for processing and generating unordered 3D data. (a) G3PT starts by encoding the input point cloud into discrete scales of token maps, each representing different levels of detail. The proposed Cross-scale Querying Transformer (CQT) utilizes a cross-attention layer with varying numbers of queries to globally connect tokens across different scales, without requiring the tokens to be in a specific order. The final output is the SDF value for each query point. (b) CQT enables 3D generation from coarse to fine scales under various conditions. An autoregressive transformer is trained using next-scale prediction.
  • Figure 2: (a) The previous quantization method in VAR VAR relies on average pooling and bilinear upsampling, which are not suitable for unordered data. (b) Our Cross-scale Vector Quantization (CVQ) overcomes this limitation by using a set of cross-scale learnable latent queries to globally "pool" and "upsample" unordered tokens. During the quantization stages, these learnable queries "downsample" features into fewer tokens at each scale, forming level-of-detail representations. These tokens are then "upsample" to their original scale using another cross-attention layer.
  • Figure 3: In next-scale prediction VAR in G3PT, the transformer predicts the next-scale token map using features derived from the "upsampled" tokens of the previous scale. The "upsampling" process involves two layers of cross-attention to align the number of tokens across scales. First, features are "upsampled" with a learnable query $\tilde{e}_s$, and then "downsampled" using "downsampling" queries $e_s$ to match the token number of the next scale. A causal mask is applied to maintain the correct order and dependencies across different scales and input conditions, ensuring coherence in the model's predictions.
  • Figure 4: Qualitative comparisons with state-of-the-art methods on the Objaverse dataset deitke2023objaverse.
  • Figure 5: Mesh visualization using SyncMVD liu2023text to generate textures for the meshes produced by G3PT.
  • ...and 1 more figures