Table of Contents
Fetching ...

LARM: A Large Articulated-Object Reconstruction Model

Sylvia Yuan, Ruoxi Shi, Xinyue Wei, Xiaoshuai Zhang, Hao Su, Minghua Liu

TL;DR

LARM addresses the challenge of reconstructing high-fidelity 3D articulated objects from sparse-view inputs by extending a transformer-based view synthesis framework to jointly model geometry, texture, and articulation. It introduces a patch-based encoding of input views with Plücker ray embeddings and joint states, enabling novel-view synthesis conditioned on arbitrary camera poses and articulation configurations, while offering auxiliary signals (depth, masks) to support explicit mesh extraction. A two-stage training regimen—static-object pretraining on Objaverse and articulation-focused finetuning on PartNet-Mobility—coupled with data augmentation, yields state-of-the-art performance in novel-view/state synthesis and 3D articulated reconstruction, including accurate joint-parameter estimation. The method demonstrates strong cross-domain applicability, including real-world iPhone captures and extension to multi-part objects, with limitations linked to data scale and seen-verse-kinematic diversity that suggest directions for future data-generation and modeling improvements.

Abstract

Modeling 3D articulated objects with realistic geometry, textures, and kinematics is essential for a wide range of applications. However, existing optimization-based reconstruction methods often require dense multi-view inputs and expensive per-instance optimization, limiting their scalability. Recent feedforward approaches offer faster alternatives but frequently produce coarse geometry, lack texture reconstruction, and rely on brittle, complex multi-stage pipelines. We introduce LARM, a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. LARM extends LVSM a recent novel view synthesis (NVS) approach for static 3D objects into the articulated setting by jointly reasoning over camera pose and articulation variation using a transformer-based architecture, enabling scalable and accurate novel view synthesis. In addition, LARM generates auxiliary outputs such as depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation. Our pipeline eliminates the need for dense supervision and supports high-fidelity reconstruction across diverse object categories. Extensive experiments demonstrate that LARM outperforms state-of-the-art methods in both novel view and state synthesis as well as 3D articulated object reconstruction, generating high-quality meshes that closely adhere to the input images. project page: https://sylviayuan-sy.github.io/larm-site/

LARM: A Large Articulated-Object Reconstruction Model

TL;DR

LARM addresses the challenge of reconstructing high-fidelity 3D articulated objects from sparse-view inputs by extending a transformer-based view synthesis framework to jointly model geometry, texture, and articulation. It introduces a patch-based encoding of input views with Plücker ray embeddings and joint states, enabling novel-view synthesis conditioned on arbitrary camera poses and articulation configurations, while offering auxiliary signals (depth, masks) to support explicit mesh extraction. A two-stage training regimen—static-object pretraining on Objaverse and articulation-focused finetuning on PartNet-Mobility—coupled with data augmentation, yields state-of-the-art performance in novel-view/state synthesis and 3D articulated reconstruction, including accurate joint-parameter estimation. The method demonstrates strong cross-domain applicability, including real-world iPhone captures and extension to multi-part objects, with limitations linked to data scale and seen-verse-kinematic diversity that suggest directions for future data-generation and modeling improvements.

Abstract

Modeling 3D articulated objects with realistic geometry, textures, and kinematics is essential for a wide range of applications. However, existing optimization-based reconstruction methods often require dense multi-view inputs and expensive per-instance optimization, limiting their scalability. Recent feedforward approaches offer faster alternatives but frequently produce coarse geometry, lack texture reconstruction, and rely on brittle, complex multi-stage pipelines. We introduce LARM, a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. LARM extends LVSM a recent novel view synthesis (NVS) approach for static 3D objects into the articulated setting by jointly reasoning over camera pose and articulation variation using a transformer-based architecture, enabling scalable and accurate novel view synthesis. In addition, LARM generates auxiliary outputs such as depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation. Our pipeline eliminates the need for dense supervision and supports high-fidelity reconstruction across diverse object categories. Extensive experiments demonstrate that LARM outperforms state-of-the-art methods in both novel view and state synthesis as well as 3D articulated object reconstruction, generating high-quality meshes that closely adhere to the input images. project page: https://sylviayuan-sy.github.io/larm-site/

Paper Structure

This paper contains 23 sections, 4 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: LARM Architecture. LARM first patchifies the sparse, posed input images into tokens by concatenating the input RGB values, Plücker ray embeddings, and corresponding joint states. The target view to be synthesized is similarly represented by its Plücker ray embeddings and a target joint state, which are concatenated and tokenized. These input and target tokens are then fed into a decoder-only transformer model that predicts tokens used to regress the target view pixels. To enable explicit 3D reconstruction, LARM is also trained to produce additional outputs beyond RGB values, such as depth maps, foreground masks, and part masks.
  • Figure 2: Joint Estimation. To estimate explicit joint parameters using the LARM model, we first synthesize numerous image pairs with similar camera poses but different joint states. Next, we establish 2D pixel-wise correspondences for the movable part, which are then lifted to 3D. The joint parameters are optimized by minimizing the distances between the corresponding 3D point pairs under the estimated transformations. To enhance robustness, this optimization is integrated with RANSAC.
  • Figure 3: Mesh Reconstruction. To extract an explicit mesh using the LARM model, we first synthesize multiple views from different camera poses. These views are lifted to 3D using the predicted depth and segmented based on the predicted masks, forming two separate colored point clouds. These point clouds are then fed into off-the-shelf point cloud-to-mesh tools peng2021shape for explicit mesh reconstruction.
  • Figure 4: Real-world Demo. We use an iPhone to capture sparse-view images of everyday articulated objects. The results demonstrate that LARM can effectively handle such inputs and predict accurate novel views across diverse camera poses and joint states.
  • Figure 5: Comparison of 3D Articulated Object Reconstruction. Note that methods such as URDFormer chen2024urdformer and Articulate-Anything le2024articulate rely on part retrieval for reconstruction, often resulting in significant mismatches in geometry and texture compared to the input prompt. In contrast, our LARM model faithfully reconstructs high-quality textured meshes that closely align with the input prompts.
  • ...and 3 more figures