Table of Contents
Fetching ...

PASTA: Controllable Part-Aware Shape Generation with Autoregressive Transformers

Songlin Li, Despoina Paschalidou, Leonidas Guibas

TL;DR

PASTA tackles controllable 3D shape generation by modeling shapes as unordered sets of cuboidal parts and then blending them into high-fidelity meshes with an occupancy-based network. The two-stage approach — autoregressive object generation followed by a blending network — enables production, completion, and multi-modal conditional generation (bounding boxes, text, and images) with schedule sampling improving results. Across PartNet chairs, tables, and lamps, PASTA outperforms prior part-based and non-part-based methods in realism and diversity while remaining simple to train. The work demonstrates practical capabilities such as language- and image-guided generation and size-controlled synthesis, with discussion of limitations and societal considerations.

Abstract

The increased demand for tools that automate the 3D content creation process led to tremendous progress in deep generative models that can generate diverse 3D objects of high fidelity. In this paper, we present PASTA, an autoregressive transformer architecture for generating high quality 3D shapes. PASTA comprises two main components: An autoregressive transformer that generates objects as a sequence of cuboidal primitives and a blending network, implemented with a transformer decoder that composes the sequences of cuboids and synthesizes high quality meshes for each object. Our model is trained in two stages: First we train our autoregressive generative model using only annotated cuboidal parts as supervision and next, we train our blending network using explicit 3D supervision, in the form of watertight meshes. Evaluations on various ShapeNet objects showcase the ability of our model to perform shape generation from diverse inputs \eg from scratch, from a partial object, from text and images, as well size-guided generation, by explicitly conditioning on a bounding box that defines the object's boundaries. Moreover, as our model considers the underlying part-based structure of a 3D object, we are able to select a specific part and produce shapes with meaningful variations of this part. As evidenced by our experiments, our model generates 3D shapes that are both more realistic and diverse than existing part-based and non part-based methods, while at the same time is simpler to implement and train.

PASTA: Controllable Part-Aware Shape Generation with Autoregressive Transformers

TL;DR

PASTA tackles controllable 3D shape generation by modeling shapes as unordered sets of cuboidal parts and then blending them into high-fidelity meshes with an occupancy-based network. The two-stage approach — autoregressive object generation followed by a blending network — enables production, completion, and multi-modal conditional generation (bounding boxes, text, and images) with schedule sampling improving results. Across PartNet chairs, tables, and lamps, PASTA outperforms prior part-based and non-part-based methods in realism and diversity while remaining simple to train. The work demonstrates practical capabilities such as language- and image-guided generation and size-controlled synthesis, with discussion of limitations and societal considerations.

Abstract

The increased demand for tools that automate the 3D content creation process led to tremendous progress in deep generative models that can generate diverse 3D objects of high fidelity. In this paper, we present PASTA, an autoregressive transformer architecture for generating high quality 3D shapes. PASTA comprises two main components: An autoregressive transformer that generates objects as a sequence of cuboidal primitives and a blending network, implemented with a transformer decoder that composes the sequences of cuboids and synthesizes high quality meshes for each object. Our model is trained in two stages: First we train our autoregressive generative model using only annotated cuboidal parts as supervision and next, we train our blending network using explicit 3D supervision, in the form of watertight meshes. Evaluations on various ShapeNet objects showcase the ability of our model to perform shape generation from diverse inputs \eg from scratch, from a partial object, from text and images, as well size-guided generation, by explicitly conditioning on a bounding box that defines the object's boundaries. Moreover, as our model considers the underlying part-based structure of a 3D object, we are able to select a specific part and produce shapes with meaningful variations of this part. As evidenced by our experiments, our model generates 3D shapes that are both more realistic and diverse than existing part-based and non part-based methods, while at the same time is simpler to implement and train.
Paper Structure (31 sections, 10 equations, 27 figures, 4 tables)

This paper contains 31 sections, 10 equations, 27 figures, 4 tables.

Figures (27)

  • Figure 1: Controllable Part-Aware 3D Shape Generation. We propose a novel autoregressive architecture that can be used to perform several editing tasks, such as generating novel shapes from scratch, conditioned on a bounding box defining the object's boundaries, completing a 3D shape from a partial input, a text, an image or bounding boxes of different sizes, as well as generating plausible variations for specific parts of the object.
  • Figure 2: Object Generator. Given a sequence of $N$ parts and a bounding box $\mathbf{B}$ defining the object boundaries, the part encoder$s_\theta(\cdot)$ maps each part $p_j$ and the bounding box to an embedding vector. The bounding box's embedding vector $\mathbf{z}_B$, the per-part embeddings $\{\mathbf{z}_j\}_{i=1}^N$ and a learnable embedding vector $\mathbf{q}$ are passed to the transformer decoder that predicts a feature vector $\mathbf{F}$ used to predict the attributes of the next part in the sequence. The part decoder takes $\mathbf{F}$ and autoregressively predicts the attribute distributions that are used to sample the attributes for the next part.
  • Figure 3: Blending Network. Given a sequence of $N$ parts, the part encoder maps them into embedding vectors $\{\mathbf{z}_j\}_{i=1}^N$. We pass the per-part embedding vectors and a set of 3D query points $\mathcal{X}$ to the transformer decoder that predicts the occupancy probabilities for the query points.
  • Figure 4: Scheduled Sampling. Given an object with $N$ parts, we first randomly permute them and keep the first $M$ parts (here $M=3$). We pass them to the object generator that predicts the next part to be generated (red cube). The newly generated cuboid is appended to the initial sequence with the $M$ objects and passed once again to the object generator, that predicts the next part to be generated. Our loss function from \ref{['eq:loss']} is computed between the new part and the $M+2$ part in the permutated sequence (yellow cube).
  • Figure 5: Shape Generation Results on Chairs. We show randomly generated chairs using our model, ATISS, PQ-NET and IM-NET.
  • ...and 22 more figures