Table of Contents
Fetching ...

ArtFormer: Controllable Generation of Diverse 3D Articulated Objects

Jiayi Su, Youhe Feng, Zheng Li, Jinhua Song, Yangfan He, Botao Ren, Botian Xu

TL;DR

ArtFormer introduces a tree-structured articulation parameterization and a diffusion-based SDF shape prior to jointly generate high-quality geometry and kinematic relations for 3D articulated objects. A dedicated Articulation Transformer with tree-position embeddings and cross-attention enables conditional, autoregressive decoding of parts, while the shape prior ensures diverse yet plausible geometry. Experiments on text- and image-conditioned generation demonstrate strong geometry fidelity, accurate joint relations, and enhanced diversity compared with baselines, with additional support from ablations and human studies. The approach supports novel shape generation and editing, offering a flexible framework for scalable, controllable articulated object synthesis with potential applications in robotics and digital twins.

Abstract

This paper presents a novel framework for modeling and conditional generation of 3D articulated objects. Troubled by flexibility-quality tradeoffs, existing methods are often limited to using predefined structures or retrieving shapes from static datasets. To address these challenges, we parameterize an articulated object as a tree of tokens and employ a transformer to generate both the object's high-level geometry code and its kinematic relations. Subsequently, each sub-part's geometry is further decoded using a signed-distance-function (SDF) shape prior, facilitating the synthesis of high-quality 3D shapes. Our approach enables the generation of diverse objects with high-quality geometry and varying number of parts. Comprehensive experiments on conditional generation from text descriptions demonstrate the effectiveness and flexibility of our method.

ArtFormer: Controllable Generation of Diverse 3D Articulated Objects

TL;DR

ArtFormer introduces a tree-structured articulation parameterization and a diffusion-based SDF shape prior to jointly generate high-quality geometry and kinematic relations for 3D articulated objects. A dedicated Articulation Transformer with tree-position embeddings and cross-attention enables conditional, autoregressive decoding of parts, while the shape prior ensures diverse yet plausible geometry. Experiments on text- and image-conditioned generation demonstrate strong geometry fidelity, accurate joint relations, and enhanced diversity compared with baselines, with additional support from ablations and human studies. The approach supports novel shape generation and editing, offering a flexible framework for scalable, controllable articulated object synthesis with potential applications in robotics and digital twins.

Abstract

This paper presents a novel framework for modeling and conditional generation of 3D articulated objects. Troubled by flexibility-quality tradeoffs, existing methods are often limited to using predefined structures or retrieving shapes from static datasets. To address these challenges, we parameterize an articulated object as a tree of tokens and employ a transformer to generate both the object's high-level geometry code and its kinematic relations. Subsequently, each sub-part's geometry is further decoded using a signed-distance-function (SDF) shape prior, facilitating the synthesis of high-quality 3D shapes. Our approach enables the generation of diverse objects with high-quality geometry and varying number of parts. Comprehensive experiments on conditional generation from text descriptions demonstrate the effectiveness and flexibility of our method.

Paper Structure

This paper contains 26 sections, 13 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: We present the Articulation TransFormer, for high-quality generation articulated objects. This figure illustrates controlled generation across random trials based on text descriptions. Notably, it can generate a diverse range of objects with varying numbers of sub-parts and different geometry features.
  • Figure 2: Training Pipeline of Shape Prior Mini encoder $\mathcal{E}_g$ compresses the geometry latent code $z$ into $c_g$, which is then processed by the embedding vectors of codebooks to form $\hat{c}_g$. $\hat{c}_g$ is the condition for diffusion decoder $\epsilon$. Each sub-part has a semantic label, such as 'the lid of cup' or 'handle of box'. These labels, encoded by the pre-trained text encoder, pass through mini encoder $\mathcal{E}_s$. The resultant vector $c_s$ is then passed into the diffusion shape prior directly.
  • Figure 3: Articulation Transformer: In the tree structure, $i$-th node carries $4$ attributes: $b_i$, $j_i$, $l_i$ and $z_i$, which respectively represent the bounding box, joint axis, limit, and geometry latent code. $\hat{o}$ represents the logits indicating whether the current output token is a terminal token $\mathcal{T}$ (a special token).
  • Figure 4: Each blue card represents a round in the predicting process. On each blue card, the left side shows the input given to the model and the expected output. The right side displays the tree structure of the articulated object formed after this prediction round, with green nodes indicating the nodes generated in this round. Orange nodes are terminal nodes.
  • Figure 5: Qualitative comparison between ArtFormer and baselines (Ours-1CB will be discussed in \ref{['sec:ablation']}). Our method is capable of generating high-quality geometry and accurate joint relations.
  • ...and 11 more figures