Table of Contents
Fetching ...

Learning Implicit Representation for Reconstructing Articulated Objects

Hao Zhang, Fang Li, Samyak Rawlekar, Narendra Ahuja

TL;DR

The paper tackles 3D reconstruction of moving articulated objects from monocular RGB videos without external 3D supervision or category-specific skeletons. It introduces LIMR, which jointly learns explicit surface geometry, color, and camera parameters, and an implicit skeleton with skinning weights, rigidity coefficients, and time-varying transforms, optimized via the $SIOS^2$ algorithm that alternates skeleton and surface updates. Key contributions include learning an implicit, near-physical skeleton from RGB videos, category-agnostic operation, and synergistic optimization of both representations, yielding improvements over state-of-the-art across multiple datasets. The work advances generalizable 3D dynamic reconstruction by leveraging motion cues to infer structure, enabling more accurate articulation modeling in-the-wild with minimal supervision.

Abstract

3D Reconstruction of moving articulated objects without additional information about object structure is a challenging problem. Current methods overcome such challenges by employing category-specific skeletal models. Consequently, they do not generalize well to articulated objects in the wild. We treat an articulated object as an unknown, semi-rigid skeletal structure surrounded by nonrigid material (e.g., skin). Our method simultaneously estimates the visible (explicit) representation (3D shapes, colors, camera parameters) and the implicit skeletal representation, from motion cues in the object video without 3D supervision. Our implicit representation consists of four parts. (1) Skeleton, which specifies how semi-rigid parts are connected. (2) \textcolor{black}{Skinning Weights}, which associates each surface vertex with semi-rigid parts with probability. (3) Rigidity Coefficients, specifying the articulation of the local surface. (4) Time-Varying Transformations, which specify the skeletal motion and surface deformation parameters. We introduce an algorithm that uses physical constraints as regularization terms and iteratively estimates both implicit and explicit representations. Our method is category-agnostic, thus eliminating the need for category-specific skeletons, we show that our method outperforms state-of-the-art across standard video datasets.

Learning Implicit Representation for Reconstructing Articulated Objects

TL;DR

The paper tackles 3D reconstruction of moving articulated objects from monocular RGB videos without external 3D supervision or category-specific skeletons. It introduces LIMR, which jointly learns explicit surface geometry, color, and camera parameters, and an implicit skeleton with skinning weights, rigidity coefficients, and time-varying transforms, optimized via the algorithm that alternates skeleton and surface updates. Key contributions include learning an implicit, near-physical skeleton from RGB videos, category-agnostic operation, and synergistic optimization of both representations, yielding improvements over state-of-the-art across multiple datasets. The work advances generalizable 3D dynamic reconstruction by leveraging motion cues to infer structure, enabling more accurate articulation modeling in-the-wild with minimal supervision.

Abstract

3D Reconstruction of moving articulated objects without additional information about object structure is a challenging problem. Current methods overcome such challenges by employing category-specific skeletal models. Consequently, they do not generalize well to articulated objects in the wild. We treat an articulated object as an unknown, semi-rigid skeletal structure surrounded by nonrigid material (e.g., skin). Our method simultaneously estimates the visible (explicit) representation (3D shapes, colors, camera parameters) and the implicit skeletal representation, from motion cues in the object video without 3D supervision. Our implicit representation consists of four parts. (1) Skeleton, which specifies how semi-rigid parts are connected. (2) \textcolor{black}{Skinning Weights}, which associates each surface vertex with semi-rigid parts with probability. (3) Rigidity Coefficients, specifying the articulation of the local surface. (4) Time-Varying Transformations, which specify the skeletal motion and surface deformation parameters. We introduce an algorithm that uses physical constraints as regularization terms and iteratively estimates both implicit and explicit representations. Our method is category-agnostic, thus eliminating the need for category-specific skeletons, we show that our method outperforms state-of-the-art across standard video datasets.
Paper Structure (22 sections, 16 equations, 13 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 16 equations, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: Method Overview. LIMR optimizes both the explicit representations $\mathcal{R}_e$, e.g., surface mesh, color $\mathbf{M}$, and camera parameters $\mathbf{P}_C$, implicit representation $\mathcal{R}_i$, e.g., skeleton $\mathbf{S_T}$, skinning weights$\mathbf{W}$, rigidity coefficients $\mathbf{R}$ derived from $\mathbf{W}$, and time-varying transformation $\mathbf{T}_t$ for root body and semi-rigid parts in time $t$, in an iterative manner. We optimize $\mathcal{R}_e$ using differentiable rendering frameworks (\ref{['sec3.2']}, $\color{orange}{\gets}$), and optimize $\mathcal{R}_i$ using physical constraints (\ref{['sec3.1']}, $\color{green}{\gets}$). We optimize $\mathcal{R}_i$ and $\mathcal{R}_e$ using the SIOS$^2$ algorithm (\ref{['sec3.3']}).
  • Figure 2: Optical Flow Warp. We backward project the 2D optical flow to the camera space, obtaining flow direction $\mathbf{F}^{S,t}$ for every vertex on the surface and calculate the visibility matrix $\mathbf{V}$ according to the viewpoint. Then we apply inverse blend skinning with $\mathbf{F}^{2\text{D},t}$, $\mathbf{V}$ and skinning weights $\mathbf{W}$ as inputs to calculate the bone motion direction $\mathbf{F}^{B,t}$. Note $t$ denotes mapping from frame $t$ to $t+1$.
  • Figure 3: Mesh Results. We show the reconstruction results of (a) Our approach, (b) Our approach w/o part refinement, (c) LASR, and (d) BANMo in the DAVIS's camel,dance-twirl and PlanetZoo's zebra.
  • Figure 4: implicit Representation Results: (a) variations with different $t_o$ values. (b) from different videos. From left to right are the results for DAVIS's camel,dance-twirl, AMA's swing, and BANMo's human-cap. (c) shows the skeleton generated by RigNet rignet.$B$ indicates the number of bones.
  • Figure 5: Rendering Results. Compare the rendering results on DAVIS's camel,dance-twirl with prior art LASR.
  • ...and 8 more figures