SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images for Articulated Objects

Haowen Wang; Zhen Zhao; Zhao Jin; Zhengping Che; Liang Qiao; Yakun Huang; Zhipeng Fan; Xiuquan Qiao; Jian Tang

SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images for Articulated Objects

Haowen Wang, Zhen Zhao, Zhao Jin, Zhengping Che, Liang Qiao, Yakun Huang, Zhipeng Fan, Xiuquan Qiao, Jian Tang

TL;DR

The paper tackles the challenge of modeling articulated objects without annotations by introducing SM$^3$, a self-supervised, multi-task framework that jointly reconstructs textured 3D geometry, segments movable parts, and estimates rotating joints from pre- and post-interaction multi-view RGB images. It leverages a deformable tetrahedral grid inspired by Nvdiffrec, plus two geometric workflows and a patch-based loss to guide integrated optimization, removing the need for labeled 3D models. To support evaluation and training for articulated objects, the authors present MMArt, a multi-view, multi-modal extension of PartNet-Mobility built in Nvidia Isaac Sim with rich sensor data. Empirical results show that SM$^3$ outperforms strong baselines across multiple categories, including real-world scenarios, highlighting its potential for scalable, annotation-free articulation modeling in robotics and AI.

Abstract

Reconstructing real-world objects and estimating their movable joint structures are pivotal technologies within the field of robotics. Previous research has predominantly focused on supervised approaches, relying on extensively annotated datasets to model articulated objects within limited categories. However, this approach falls short of effectively addressing the diversity present in the real world. To tackle this issue, we propose a self-supervised interaction perception method, referred to as SM$^3$, which leverages multi-view RGB images captured before and after interaction to model articulated objects, identify the movable parts, and infer the parameters of their rotating joints. By constructing 3D geometries and textures from the captured 2D images, SM$^3$ achieves integrated optimization of movable part and joint parameters during the reconstruction process, obviating the need for annotations. Furthermore, we introduce the MMArt dataset, an extension of PartNet-Mobility, encompassing multi-view and multi-modal data of articulated objects spanning diverse categories. Evaluations demonstrate that SM$^3$ surpasses existing benchmarks across various categories and objects, while its adaptability in real-world scenarios has been thoroughly validated.

SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images for Articulated Objects

TL;DR

The paper tackles the challenge of modeling articulated objects without annotations by introducing SM

, a self-supervised, multi-task framework that jointly reconstructs textured 3D geometry, segments movable parts, and estimates rotating joints from pre- and post-interaction multi-view RGB images. It leverages a deformable tetrahedral grid inspired by Nvdiffrec, plus two geometric workflows and a patch-based loss to guide integrated optimization, removing the need for labeled 3D models. To support evaluation and training for articulated objects, the authors present MMArt, a multi-view, multi-modal extension of PartNet-Mobility built in Nvidia Isaac Sim with rich sensor data. Empirical results show that SM

outperforms strong baselines across multiple categories, including real-world scenarios, highlighting its potential for scalable, annotation-free articulation modeling in robotics and AI.

Abstract

, which leverages multi-view RGB images captured before and after interaction to model articulated objects, identify the movable parts, and infer the parameters of their rotating joints. By constructing 3D geometries and textures from the captured 2D images, SM

achieves integrated optimization of movable part and joint parameters during the reconstruction process, obviating the need for annotations. Furthermore, we introduce the MMArt dataset, an extension of PartNet-Mobility, encompassing multi-view and multi-modal data of articulated objects spanning diverse categories. Evaluations demonstrate that SM

surpasses existing benchmarks across various categories and objects, while its adaptability in real-world scenarios has been thoroughly validated.

Paper Structure (26 sections, 10 equations, 5 figures, 3 tables)

This paper contains 26 sections, 10 equations, 5 figures, 3 tables.

Introduction
Related Work
3D Reconstruction for Multi-View Images
Motion Structure Estimation
Datasets for Articulated Objects.
Methodology
Overview
Reconstruction Based on Tetrahedron
Movable Part Segmentation Prior
Candidate Joints Prediction
Integrated Optimization
The MMArt Dataset
Experiments
Evaluation Metrics
Geometric Reconstruction
...and 11 more sections

Figures (5)

Figure 1: Our proposed SM$^3$ enables textured 3D reconstruction and articulation structure estimation solely from multi-view images captured before and after object interaction.
Figure 2: Architecture Overview of proposed SM$^3$.
Figure 3: Algorithmic workflow for Movable Part Segmentation Prior and Candidate Joints Prediction.
Figure 4: Visualization of 3D reconstruction and articulation. Movable parts are shown as green point clouds for clarity.
Figure 5: Visualization results for real-world articulated objects.

SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images for Articulated Objects

TL;DR

Abstract

SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images for Articulated Objects

Authors

TL;DR

Abstract

Table of Contents

Figures (5)