MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Haitian Li; Haozhe Xie; Junxiang Xu; Beichen Wen; Fangzhou Hong; Ziwei Liu

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu

Abstract

Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Abstract

Paper Structure (54 sections, 24 equations, 10 figures, 4 tables)

This paper contains 54 sections, 24 equations, 10 figures, 4 tables.

Introduction
Related Work
Articulated Object Modeling
3D Part Segmentation
Our Approach
Overview
TRELLIS-based 3D Generator
Part-Aware Semantic Reasoner
Tri-linear Interpolation
Triplane Projection
Part Contrast Transformer
Dual-Query Motion Decoder
Dual-query Initialization
Refinement Block
Query Confidence Estimation
...and 39 more sections

Figures (10)

Figure 1: (Left) Qualitative results of SINGAPO DBLP:conf/iclr/LiuICSA25, Articulate-Anything (ArtAny) DBLP:conf/iclr/LeXLWYMVKJE25, PhysX-Anything (PhysXAny) DBLP:journals/corr/abs-2511-13648, and MonoArt on diverse objects. (Right) F-score vs. inference time on the PartNet-Mobility DBLP:conf/cvpr/XiangQMXZLLJYWY20 test set. Circles indicate models evaluated on 7 categories, while triangles denote models supporting all 46 categories.
Figure 2: Overview of MonoArt. TRELLIS-based 3D Generator reconstructs a canonical shape from a single image. Part-Aware Semantic Reasoner derives tri-plane-based part embeddings. Dual-Query Motion Decoder performs iterative motion reasoning, and Kinematic Estimator predicts part-level articulation parameters (motion type, origin, axis, limits) and infers the kinematic tree structure. Note that "Attn.", "Interp.", "Proj.", "Cont.", "Trans.", and "Init." represent "Attention", "Interpolation", "Projection", "Contrast", "Transformer", and "Initialization", respectively. $\oplus$ and $\otimes$ denote element-wise addition and matrix multiplication, respectively.
Figure 3: Qualitative results on the test set of PartNet-Mobility. ArtAny and PhysXAny denote Articulate-Anything and PhysXAnything, respectively. For each object, we show the reconstructed geometry under two sampled articulated states.
Figure 4: Qualitative results on in-the-wild images. ArtAny and PhysXAny denote Articulate-Anything and PhysXAnything, respectively. For each object, we show the reconstructed geometry under two sampled articulated states.
Figure 5: Robot manipulation with generated articulated objects. MonoArt reconstructions are directly imported into IsaacSim for contact-rich interaction.
...and 5 more figures

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Abstract

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Authors

Abstract

Table of Contents

Figures (10)