Table of Contents
Fetching ...

OctGPT: Octree-based Multiscale Autoregressive Models for 3D Shape Generation

Si-Tong Wei, Rui-Huan Wang, Chuan-Zhi Zhou, Baoquan Chen, Peng-Shuai Wang

TL;DR

OctGPT is introduced, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models.

Abstract

Autoregressive models have achieved remarkable success across various domains, yet their performance in 3D shape generation lags significantly behind that of diffusion models. In this paper, we introduce OctGPT, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models. Our method employs a serialized octree representation to efficiently capture the hierarchical and spatial structures of 3D shapes. Coarse geometry is encoded via octree structures, while fine-grained details are represented by binary tokens generated using a vector quantized variational autoencoder (VQVAE), transforming 3D shapes into compact multiscale binary sequences suitable for autoregressive prediction. To address the computational challenges of handling long sequences, we incorporate octree-based transformers enhanced with 3D rotary positional encodings, scale-specific embeddings, and token-parallel generation schemes. These innovations reduce training time by 13 folds and generation time by 69 folds, enabling the efficient training of high-resolution 3D shapes, e.g.,$1024^3$, on just four NVIDIA 4090 GPUs only within days. OctGPT showcases exceptional versatility across various tasks, including text-, sketch-, and image-conditioned generation, as well as scene-level synthesis involving multiple objects. Extensive experiments demonstrate that OctGPT accelerates convergence and improves generation quality over prior autoregressive methods, offering a new paradigm for high-quality, scalable 3D content creation. Our code and trained models are available at https://github.com/octree-nn/octgpt.

OctGPT: Octree-based Multiscale Autoregressive Models for 3D Shape Generation

TL;DR

OctGPT is introduced, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models.

Abstract

Autoregressive models have achieved remarkable success across various domains, yet their performance in 3D shape generation lags significantly behind that of diffusion models. In this paper, we introduce OctGPT, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models. Our method employs a serialized octree representation to efficiently capture the hierarchical and spatial structures of 3D shapes. Coarse geometry is encoded via octree structures, while fine-grained details are represented by binary tokens generated using a vector quantized variational autoencoder (VQVAE), transforming 3D shapes into compact multiscale binary sequences suitable for autoregressive prediction. To address the computational challenges of handling long sequences, we incorporate octree-based transformers enhanced with 3D rotary positional encodings, scale-specific embeddings, and token-parallel generation schemes. These innovations reduce training time by 13 folds and generation time by 69 folds, enabling the efficient training of high-resolution 3D shapes, e.g.,, on just four NVIDIA 4090 GPUs only within days. OctGPT showcases exceptional versatility across various tasks, including text-, sketch-, and image-conditioned generation, as well as scene-level synthesis involving multiple objects. Extensive experiments demonstrate that OctGPT accelerates convergence and improves generation quality over prior autoregressive methods, offering a new paradigm for high-quality, scalable 3D content creation. Our code and trained models are available at https://github.com/octree-nn/octgpt.

Paper Structure

This paper contains 36 sections, 2 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Overview. 3D shapes are encoded as multiscale serialized octrees, where coarse structures are represented by multiscale binary splitting signals derived from the octree hierarchy, and fine-grained details are captured by binarized latent codes from an octree-based VQVAE. These binary tokens, along with teacher-forcing masks, are fed into a transformer for autoregressive training. During inference, the transformer progressively predicts the token sequence to reconstruct the octree and latent codes, generating 3D shapes from coarse to fine. The sequence is decoded by the VQVAE to produce the final 3D shape.
  • Figure 2: Octree and z-order curves. 2D images are used for clearer illustration. (a): The input point cloud with its corresponding octree. Node statuses are color-coded: darker colors represent nodes containing points, lighter colors indicate empty nodes, and gray denotes non-existing nodes at the given depth. (b) & (c): z-order curves at octree depths 2 and 3, respectively.
  • Figure 3: The architecture of Octree-based VQ-VAE. The encoder compresses the input octree signals with octree-based residual blocks and reduce the depth of the octree by 2. The features are then quantized into binary tokens and fed into the decoder. The decoder builds a dual octree graph and applies graph convolution to predict SDFs for shape reconstruction.
  • Figure 4: Multiscale Autoregressive Models. (a) Our model predicts multiple tokens autoregressively according to the depth-wise teacher-forcing mask. Tokens at different scales are represented in distinct colors, while masks are depicted in gray. (b) Octree-based Window attention is adopted for cross-scale communication and improved computational efficiency. (c) Shifted window attention allows for interactions across different windows.
  • Figure 5: Comparision with state-of-the-art 3D autoregressive models. Experiments are conducted on the chair category. Top: the generated shapes. Bottom: the corresponding token.
  • ...and 12 more figures