Table of Contents
Fetching ...

PAct: Part-Decomposed Single-View Articulated Object Generation

Qingming Liu, Xinyue Yao, Shuyuan Zhang, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia

TL;DR

This work introduces a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning, and enables fast feed-forward inference, and supports controllable assembly and articulation, which are important for embodied interaction.

Abstract

Articulated objects are central to interactive 3D applications, including embodied AI, robotics, and VR/AR, where functional part decomposition and kinematic motion are essential. Yet producing high-fidelity articulated assets remains difficult to scale because it requires reliable part decomposition and kinematic rigging. Existing approaches largely fall into two paradigms: optimization-based reconstruction or distillation, which can be accurate but often takes tens of minutes to hours per instance, and inference-time methods that rely on template or part retrieval, producing plausible results that may not match the specific structure and appearance in the input observation. We introduce a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning. Our representation models an object as a set of movable parts, each encoded by latent tokens augmented with part identity and articulation cues. Conditioned on a single image, the model generates articulated 3D assets that preserve instance-level correspondence while maintaining valid part structure and motion. The resulting approach avoids per-instance optimization, enables fast feed-forward inference, and supports controllable assembly and articulation, which are important for embodied interaction. Experiments on common articulated categories (e.g., drawers and doors) show improved input consistency, part accuracy, and articulation plausibility over optimization-based and retrieval-driven baselines, while substantially reducing inference time.

PAct: Part-Decomposed Single-View Articulated Object Generation

TL;DR

This work introduces a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning, and enables fast feed-forward inference, and supports controllable assembly and articulation, which are important for embodied interaction.

Abstract

Articulated objects are central to interactive 3D applications, including embodied AI, robotics, and VR/AR, where functional part decomposition and kinematic motion are essential. Yet producing high-fidelity articulated assets remains difficult to scale because it requires reliable part decomposition and kinematic rigging. Existing approaches largely fall into two paradigms: optimization-based reconstruction or distillation, which can be accurate but often takes tens of minutes to hours per instance, and inference-time methods that rely on template or part retrieval, producing plausible results that may not match the specific structure and appearance in the input observation. We introduce a part-centric generative framework for articulated object creation that synthesizes part geometry, composition, and articulation under explicit part-aware conditioning. Our representation models an object as a set of movable parts, each encoded by latent tokens augmented with part identity and articulation cues. Conditioned on a single image, the model generates articulated 3D assets that preserve instance-level correspondence while maintaining valid part structure and motion. The resulting approach avoids per-instance optimization, enables fast feed-forward inference, and supports controllable assembly and articulation, which are important for embodied interaction. Experiments on common articulated categories (e.g., drawers and doors) show improved input consistency, part accuracy, and articulation plausibility over optimization-based and retrieval-driven baselines, while substantially reducing inference time.
Paper Structure (38 sections, 11 equations, 10 figures, 5 tables)

This paper contains 38 sections, 11 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Pipeline of PAct. Stage 1 predicts a part-decomposed sparse structure from a single image using a Part-Aware Flow Model. Stage 2 refines it into detailed 3D part representations via sparse-transformer denoising, while an articulation module aggregates multi-step features to estimate joint parameters for each part. The predicted joint parameters together with the reconstructed part geometries form the final articulated object.
  • Figure 2: Architecture of the Stage 1 model. To accommodate multi-part inputs, a subset of attention layers in the original TRELLIS network is repurposed to perform within-part local attention. With this modification, the model is fine-tuned on the target dataset.
  • Figure 3: Illustration of the Stage 2 pipeline. We cache a subset of intermediate features during part-token denoising. For each part, these features are averaged over selected inference timesteps and then pooled over the token dimension. A learned MLP then predicts the articulation parameters.
  • Figure 4: Qualitative comparison on the PartNet-Mobility xiang2020sapien and ACD iliash2024s2o datasets. We compare PAct with SINGAPO liu2024singapo, Articulate-Anything le2024articulate.
  • Figure 5: Qualitative comparison with FreeArt3D chen2025freeart3d. We visualize the geometry of different parts using distinct colors.
  • ...and 5 more figures