Learning Implicit Representation for Reconstructing Articulated Objects
Hao Zhang, Fang Li, Samyak Rawlekar, Narendra Ahuja
TL;DR
The paper tackles 3D reconstruction of moving articulated objects from monocular RGB videos without external 3D supervision or category-specific skeletons. It introduces LIMR, which jointly learns explicit surface geometry, color, and camera parameters, and an implicit skeleton with skinning weights, rigidity coefficients, and time-varying transforms, optimized via the $SIOS^2$ algorithm that alternates skeleton and surface updates. Key contributions include learning an implicit, near-physical skeleton from RGB videos, category-agnostic operation, and synergistic optimization of both representations, yielding improvements over state-of-the-art across multiple datasets. The work advances generalizable 3D dynamic reconstruction by leveraging motion cues to infer structure, enabling more accurate articulation modeling in-the-wild with minimal supervision.
Abstract
3D Reconstruction of moving articulated objects without additional information about object structure is a challenging problem. Current methods overcome such challenges by employing category-specific skeletal models. Consequently, they do not generalize well to articulated objects in the wild. We treat an articulated object as an unknown, semi-rigid skeletal structure surrounded by nonrigid material (e.g., skin). Our method simultaneously estimates the visible (explicit) representation (3D shapes, colors, camera parameters) and the implicit skeletal representation, from motion cues in the object video without 3D supervision. Our implicit representation consists of four parts. (1) Skeleton, which specifies how semi-rigid parts are connected. (2) \textcolor{black}{Skinning Weights}, which associates each surface vertex with semi-rigid parts with probability. (3) Rigidity Coefficients, specifying the articulation of the local surface. (4) Time-Varying Transformations, which specify the skeletal motion and surface deformation parameters. We introduce an algorithm that uses physical constraints as regularization terms and iteratively estimates both implicit and explicit representations. Our method is category-agnostic, thus eliminating the need for category-specific skeletons, we show that our method outperforms state-of-the-art across standard video datasets.
