SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images for Articulated Objects
Haowen Wang, Zhen Zhao, Zhao Jin, Zhengping Che, Liang Qiao, Yakun Huang, Zhipeng Fan, Xiuquan Qiao, Jian Tang
TL;DR
The paper tackles the challenge of modeling articulated objects without annotations by introducing SM$^3$, a self-supervised, multi-task framework that jointly reconstructs textured 3D geometry, segments movable parts, and estimates rotating joints from pre- and post-interaction multi-view RGB images. It leverages a deformable tetrahedral grid inspired by Nvdiffrec, plus two geometric workflows and a patch-based loss to guide integrated optimization, removing the need for labeled 3D models. To support evaluation and training for articulated objects, the authors present MMArt, a multi-view, multi-modal extension of PartNet-Mobility built in Nvidia Isaac Sim with rich sensor data. Empirical results show that SM$^3$ outperforms strong baselines across multiple categories, including real-world scenarios, highlighting its potential for scalable, annotation-free articulation modeling in robotics and AI.
Abstract
Reconstructing real-world objects and estimating their movable joint structures are pivotal technologies within the field of robotics. Previous research has predominantly focused on supervised approaches, relying on extensively annotated datasets to model articulated objects within limited categories. However, this approach falls short of effectively addressing the diversity present in the real world. To tackle this issue, we propose a self-supervised interaction perception method, referred to as SM$^3$, which leverages multi-view RGB images captured before and after interaction to model articulated objects, identify the movable parts, and infer the parameters of their rotating joints. By constructing 3D geometries and textures from the captured 2D images, SM$^3$ achieves integrated optimization of movable part and joint parameters during the reconstruction process, obviating the need for annotations. Furthermore, we introduce the MMArt dataset, an extension of PartNet-Mobility, encompassing multi-view and multi-modal data of articulated objects spanning diverse categories. Evaluations demonstrate that SM$^3$ surpasses existing benchmarks across various categories and objects, while its adaptability in real-world scenarios has been thoroughly validated.
