Table of Contents
Fetching ...

Pushing the Limits of 3D Shape Generation at Scale

Yu Wang, Xuelin Qian, Jingyang Huo, Tiejun Huang, Bo Zhao, Yanwei Fu

TL;DR

Argus-3D scales 3D shape generation to 3.6B parameters by combining tri-plane latent representations with a discrete codebook and a multimodal Transformer. It employs a two-stage pipeline: learn discrete latent codes from tri-plane features and then autoregressively generate codebook indices conditioned on multimodal inputs, enabling unconditional, class-guided, image-guided, and text-guided generation. Trained on ~900k shapes from diverse sources, it achieves state-of-the-art quality and diversity across multiple generation tasks, illustrating the practicality of large-scale 3D generative models for gaming, VR, and product design. The work also discusses data and computation demands and outlines future directions for more efficient architectures and novel 3D representations to broaden applicability and reduce resource requirements.

Abstract

We present a significant breakthrough in 3D shape generation by scaling it to unprecedented dimensions. Through the adaptation of the Auto-Regressive model and the utilization of large language models, we have developed a remarkable model with an astounding 3.6 billion trainable parameters, establishing it as the largest 3D shape generation model to date, named Argus-3D. Our approach addresses the limitations of existing methods by enhancing the quality and diversity of generated 3D shapes. To tackle the challenges of high-resolution 3D shape generation, our model incorporates tri-plane features as latent representations, effectively reducing computational complexity. Additionally, we introduce a discrete codebook for efficient quantization of these representations. Leveraging the power of transformers, we enable multi-modal conditional generation, facilitating the production of diverse and visually impressive 3D shapes. To train our expansive model, we leverage an ensemble of publicly-available 3D datasets, consisting of a comprehensive collection of approximately 900,000 objects from renowned repositories such as ModelNet40, ShapeNet, Pix3D, 3D-Future, and Objaverse. This diverse dataset empowers our model to learn from a wide range of object variations, bolstering its ability to generate high-quality and diverse 3D shapes. Extensive experimentation demonstrate the remarkable efficacy of our approach in significantly improving the visual quality of generated 3D shapes. By pushing the boundaries of 3D generation, introducing novel methods for latent representation learning, and harnessing the power of transformers for multi-modal conditional generation, our contributions pave the way for substantial advancements in the field. Our work unlocks new possibilities for applications in gaming, virtual reality, product design, and other domains that demand high-quality and diverse 3D objects.

Pushing the Limits of 3D Shape Generation at Scale

TL;DR

Argus-3D scales 3D shape generation to 3.6B parameters by combining tri-plane latent representations with a discrete codebook and a multimodal Transformer. It employs a two-stage pipeline: learn discrete latent codes from tri-plane features and then autoregressively generate codebook indices conditioned on multimodal inputs, enabling unconditional, class-guided, image-guided, and text-guided generation. Trained on ~900k shapes from diverse sources, it achieves state-of-the-art quality and diversity across multiple generation tasks, illustrating the practicality of large-scale 3D generative models for gaming, VR, and product design. The work also discusses data and computation demands and outlines future directions for more efficient architectures and novel 3D representations to broaden applicability and reduce resource requirements.

Abstract

We present a significant breakthrough in 3D shape generation by scaling it to unprecedented dimensions. Through the adaptation of the Auto-Regressive model and the utilization of large language models, we have developed a remarkable model with an astounding 3.6 billion trainable parameters, establishing it as the largest 3D shape generation model to date, named Argus-3D. Our approach addresses the limitations of existing methods by enhancing the quality and diversity of generated 3D shapes. To tackle the challenges of high-resolution 3D shape generation, our model incorporates tri-plane features as latent representations, effectively reducing computational complexity. Additionally, we introduce a discrete codebook for efficient quantization of these representations. Leveraging the power of transformers, we enable multi-modal conditional generation, facilitating the production of diverse and visually impressive 3D shapes. To train our expansive model, we leverage an ensemble of publicly-available 3D datasets, consisting of a comprehensive collection of approximately 900,000 objects from renowned repositories such as ModelNet40, ShapeNet, Pix3D, 3D-Future, and Objaverse. This diverse dataset empowers our model to learn from a wide range of object variations, bolstering its ability to generate high-quality and diverse 3D shapes. Extensive experimentation demonstrate the remarkable efficacy of our approach in significantly improving the visual quality of generated 3D shapes. By pushing the boundaries of 3D generation, introducing novel methods for latent representation learning, and harnessing the power of transformers for multi-modal conditional generation, our contributions pave the way for substantial advancements in the field. Our work unlocks new possibilities for applications in gaming, virtual reality, product design, and other domains that demand high-quality and diverse 3D objects.
Paper Structure (34 sections, 7 equations, 11 figures, 3 tables)

This paper contains 34 sections, 7 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of our model. We derive the shape features from arbitrary 3D shape via encoding points feature into the three normalized orthogonal planes, and project them into a latent vector. Next, a learnable codebook quantize it for discrete representation. These discrete representation assist the transformer in the second stage to learn the joint distribution corresponding to a large number of shape features encoded in the codebook. Futhermore, by concatenating various conditions such as images or text, transformer is capable of generate discrete representation for 3D shapes. We develop a remarkable 3D shape generalization model with 3.6 billion trainable parameters.
  • Figure 2: Qualitative results of unconditional generation.
  • Figure 3: Qualitative results of class-guide generation.
  • Figure 4: Visualizations of image-guide shape generation.
  • Figure 5: Results of real-world image-guide generation.
  • ...and 6 more figures