LARM: A Large Articulated-Object Reconstruction Model
Sylvia Yuan, Ruoxi Shi, Xinyue Wei, Xiaoshuai Zhang, Hao Su, Minghua Liu
TL;DR
LARM addresses the challenge of reconstructing high-fidelity 3D articulated objects from sparse-view inputs by extending a transformer-based view synthesis framework to jointly model geometry, texture, and articulation. It introduces a patch-based encoding of input views with Plücker ray embeddings and joint states, enabling novel-view synthesis conditioned on arbitrary camera poses and articulation configurations, while offering auxiliary signals (depth, masks) to support explicit mesh extraction. A two-stage training regimen—static-object pretraining on Objaverse and articulation-focused finetuning on PartNet-Mobility—coupled with data augmentation, yields state-of-the-art performance in novel-view/state synthesis and 3D articulated reconstruction, including accurate joint-parameter estimation. The method demonstrates strong cross-domain applicability, including real-world iPhone captures and extension to multi-part objects, with limitations linked to data scale and seen-verse-kinematic diversity that suggest directions for future data-generation and modeling improvements.
Abstract
Modeling 3D articulated objects with realistic geometry, textures, and kinematics is essential for a wide range of applications. However, existing optimization-based reconstruction methods often require dense multi-view inputs and expensive per-instance optimization, limiting their scalability. Recent feedforward approaches offer faster alternatives but frequently produce coarse geometry, lack texture reconstruction, and rely on brittle, complex multi-stage pipelines. We introduce LARM, a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. LARM extends LVSM a recent novel view synthesis (NVS) approach for static 3D objects into the articulated setting by jointly reasoning over camera pose and articulation variation using a transformer-based architecture, enabling scalable and accurate novel view synthesis. In addition, LARM generates auxiliary outputs such as depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation. Our pipeline eliminates the need for dense supervision and supports high-fidelity reconstruction across diverse object categories. Extensive experiments demonstrate that LARM outperforms state-of-the-art methods in both novel view and state synthesis as well as 3D articulated object reconstruction, generating high-quality meshes that closely adhere to the input images. project page: https://sylviayuan-sy.github.io/larm-site/
