Table of Contents
Fetching ...

Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability

Xuelin Qian, Yu Wang, Simian Luo, Yinda Zhang, Ying Tai, Zhenyu Zhang, Chengjie Wang, Xiangyang Xue, Bo Zhao, Tiejun Huang, Yunsheng Wu, Yanwei Fu

TL;DR

This paper extends auto-regressive models to 3D domains, and introduces discrete representation learning based on a latent vector instead of volumetric grids, which not only reduces computational costs but also preserves essential geometric details by learning the joint distributions in a more tractable order.

Abstract

Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. In this paper, we extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously. Firstly, we leverage an ensemble of publicly available 3D datasets to facilitate the training of large-scale models. It consists of a comprehensive collection of approximately 900,000 objects, with multiple properties of meshes, points, voxels, rendered images, and text captions. This diverse labeled dataset, termed Objaverse-Mix, empowers our model to learn from a wide range of object variations. However, directly applying 3D auto-regression encounters critical challenges of high computational demands on volumetric grids and ambiguous auto-regressive order along grid dimensions, resulting in inferior quality of 3D shapes. To this end, we then present a novel framework Argus3D in terms of capacity. Concretely, our approach introduces discrete representation learning based on a latent vector instead of volumetric grids, which not only reduces computational costs but also preserves essential geometric details by learning the joint distributions in a more tractable order. The capacity of conditional generation can thus be realized by simply concatenating various conditioning inputs to the latent vector, such as point clouds, categories, images, and texts. In addition, thanks to the simplicity of our model architecture, we naturally scale up our approach to a larger model with an impressive 3.6 billion parameters, further enhancing the quality of versatile 3D generation. Extensive experiments on four generation tasks demonstrate that Argus3D can synthesize diverse and faithful shapes across multiple categories, achieving remarkable performance.

Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability

TL;DR

This paper extends auto-regressive models to 3D domains, and introduces discrete representation learning based on a latent vector instead of volumetric grids, which not only reduces computational costs but also preserves essential geometric details by learning the joint distributions in a more tractable order.

Abstract

Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. In this paper, we extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously. Firstly, we leverage an ensemble of publicly available 3D datasets to facilitate the training of large-scale models. It consists of a comprehensive collection of approximately 900,000 objects, with multiple properties of meshes, points, voxels, rendered images, and text captions. This diverse labeled dataset, termed Objaverse-Mix, empowers our model to learn from a wide range of object variations. However, directly applying 3D auto-regression encounters critical challenges of high computational demands on volumetric grids and ambiguous auto-regressive order along grid dimensions, resulting in inferior quality of 3D shapes. To this end, we then present a novel framework Argus3D in terms of capacity. Concretely, our approach introduces discrete representation learning based on a latent vector instead of volumetric grids, which not only reduces computational costs but also preserves essential geometric details by learning the joint distributions in a more tractable order. The capacity of conditional generation can thus be realized by simply concatenating various conditioning inputs to the latent vector, such as point clouds, categories, images, and texts. In addition, thanks to the simplicity of our model architecture, we naturally scale up our approach to a larger model with an impressive 3.6 billion parameters, further enhancing the quality of versatile 3D generation. Extensive experiments on four generation tasks demonstrate that Argus3D can synthesize diverse and faithful shapes across multiple categories, achieving remarkable performance.
Paper Structure (18 sections, 6 equations, 27 figures, 12 tables)

This paper contains 18 sections, 6 equations, 27 figures, 12 tables.

Figures (27)

  • Figure 1: (a) We have combined five public 3D shape datasets, amassing a total of approximately 900,000 diverse shapes. (b) We manually filter out some noisy shapes, such as irregular shapes, complex scenes, non-watertight meshes and discrete shapes. (c) Our Objaverse-Mix dataset includes meshes, point clouds, occupancies, rendered images, and text captions, showcasing its multi-modal properties.
  • Figure 2: We propose an improved auto-regressive model to learn versatile 3D shape generation. Our approach can either generate diverse and faithful shapes with multiple categories via an unconditional way (one column to the left), or can be adapted for conditional generation by incorporating various conditioning inputs given on the left-top (three columns to the right).
  • Figure 3: Illustration of auto-regressive generation for grid-based representation. Here, we show three different flattening orders as examples. Best viewed in color.
  • Figure 4: Overview of our Argus3D. Given an arbitrary 3D shape, we first project encoded volumetric grids into the three axis-aligned planes, and then use a coupling network to further project them into a latent vector. Vector quantization is thus performed on it for discrete representation. Taking advantage of such a compact representation with tractable orders, vanilla transformers are adopted to auto-repressively learn shape distributions. Furthermore, we can freely switch from unconditional generation to conditional generation by concatenating various conditions, such as point clouds, categories and images.
  • Figure 5: Illustration of scaling up our models.
  • ...and 22 more figures