Table of Contents
Fetching ...

S3O: A Dual-Phase Approach for Reconstructing Dynamic Shape and Skeleton of Articulated Objects from Single Monocular Video

Hao Zhang, Fang Li, Samyak Rawlekar, Narendra Ahuja

TL;DR

S3O tackles reconstructing dynamic articulated objects from a single monocular video by jointly learning visible shape and an underlying skeleton without pre-defined templates or camera poses. It introduces a dual-phase, coarse-to-fine optimization with an EM-like loop that alternates between updating the shape/skeleton and refining time-varying motion and mesh details, guided by physical constraints and Dynamic Rigidity. A canonical-frame-based coarse initialization bootstraps the process, after which time-varying motion and refined skeletons are learned from motion cues such as 2D optical flow. Empirical results on standard benchmarks and the PlanetZoo dataset show improved shape and skeleton reconstruction and around 40% less training time compared with state-of-the-art methods, demonstrating practical efficiency and robustness for monocular video reconstruction of dynamic articulated objects.

Abstract

Reconstructing dynamic articulated objects from a singular monocular video is challenging, requiring joint estimation of shape, motion, and camera parameters from limited views. Current methods typically demand extensive computational resources and training time, and require additional human annotations such as predefined parametric models, camera poses, and key points, limiting their generalizability. We propose Synergistic Shape and Skeleton Optimization (S3O), a novel two-phase method that forgoes these prerequisites and efficiently learns parametric models including visible shapes and underlying skeletons. Conventional strategies typically learn all parameters simultaneously, leading to interdependencies where a single incorrect prediction can result in significant errors. In contrast, S3O adopts a phased approach: it first focuses on learning coarse parametric models, then progresses to motion learning and detail addition. This method substantially lowers computational complexity and enhances robustness in reconstruction from limited viewpoints, all without requiring additional annotations. To address the current inadequacies in 3D reconstruction from monocular video benchmarks, we collected the PlanetZoo dataset. Our experimental evaluations on standard benchmarks and the PlanetZoo dataset affirm that S3O provides more accurate 3D reconstruction, and plausible skeletons, and reduces the training time by approximately 60% compared to the state-of-the-art, thus advancing the state of the art in dynamic object reconstruction.

S3O: A Dual-Phase Approach for Reconstructing Dynamic Shape and Skeleton of Articulated Objects from Single Monocular Video

TL;DR

S3O tackles reconstructing dynamic articulated objects from a single monocular video by jointly learning visible shape and an underlying skeleton without pre-defined templates or camera poses. It introduces a dual-phase, coarse-to-fine optimization with an EM-like loop that alternates between updating the shape/skeleton and refining time-varying motion and mesh details, guided by physical constraints and Dynamic Rigidity. A canonical-frame-based coarse initialization bootstraps the process, after which time-varying motion and refined skeletons are learned from motion cues such as 2D optical flow. Empirical results on standard benchmarks and the PlanetZoo dataset show improved shape and skeleton reconstruction and around 40% less training time compared with state-of-the-art methods, demonstrating practical efficiency and robustness for monocular video reconstruction of dynamic articulated objects.

Abstract

Reconstructing dynamic articulated objects from a singular monocular video is challenging, requiring joint estimation of shape, motion, and camera parameters from limited views. Current methods typically demand extensive computational resources and training time, and require additional human annotations such as predefined parametric models, camera poses, and key points, limiting their generalizability. We propose Synergistic Shape and Skeleton Optimization (S3O), a novel two-phase method that forgoes these prerequisites and efficiently learns parametric models including visible shapes and underlying skeletons. Conventional strategies typically learn all parameters simultaneously, leading to interdependencies where a single incorrect prediction can result in significant errors. In contrast, S3O adopts a phased approach: it first focuses on learning coarse parametric models, then progresses to motion learning and detail addition. This method substantially lowers computational complexity and enhances robustness in reconstruction from limited viewpoints, all without requiring additional annotations. To address the current inadequacies in 3D reconstruction from monocular video benchmarks, we collected the PlanetZoo dataset. Our experimental evaluations on standard benchmarks and the PlanetZoo dataset affirm that S3O provides more accurate 3D reconstruction, and plausible skeletons, and reduces the training time by approximately 60% compared to the state-of-the-art, thus advancing the state of the art in dynamic object reconstruction.
Paper Structure (22 sections, 14 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 14 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: Overview of Skeleton Learning of S3O: We start by deriving a 3D skeleton from 2D inputs and learning a coarse shape, represented in 'cyan' for the coarse and 'gray' for the current shape. The next step involves refining this shape by adjusting the skeleton, shifting bones, and expanding the structure. We then upsample the skeleton points for greater detail before applying final physical constraints to ensure proper results.
  • Figure 2: Mesh Results. We show the reconstruction results of (a) LASR, (b) BANMo, and (c) Ours in the DAVIS's camel, cows and PlanetZoo's zebra, elephant, tiger, and giraffe. Top view of the reconstruction results are shown in Fig.\ref{['fig_moreview']}.
  • Figure 3: Skeleton and Skinning Weight Results. (a), (b), and (c) show the skeleton and skinning weights results from S3O with different bone motion similarity thresholds ($0.99, 0.95, 0.90$). (d) shows the bone distribution and skinning weights from LASR. Since each vertex on the surface is assigned to multiple bones (each given a different color) via the skinning weights, here we show the vertex in the color of the bone which has the maximum skinning weight for the vertex.
  • Figure 4: Skeleton Comparision. We compared the learned skeleton for tiger, elephant, camel from our approach with RigNet rignet and Skeletor skeletor.
  • Figure 5: Rendering Results of S3O and LASR on camel video.
  • ...and 11 more figures