REACTO: Reconstructing Articulated Objects from a Single Video

Chaoyue Song; Jiacheng Wei; Chuan-Sheng Foo; Guosheng Lin; Fayao Liu

REACTO: Reconstructing Articulated Objects from a Single Video

Chaoyue Song, Jiacheng Wei, Chuan-Sheng Foo, Guosheng Lin, Fayao Liu

TL;DR

REACTO tackles the challenge of reconstructing general articulated 3D objects from a single monocular video by combining a canonical NeRF-based shape/appearance model with a novel deformation scheme called Quasi-Rigid Blend Skinning (QRBS). QRBS rigidifies each object component by rigging on bones, enforces quasi-sparsity in skinning weights, and uses geodesic point assignment to prevent seam artifacts and preserve joint flexibility, enabling accurate motion and surface detail. The approach is validated on real and synthetic datasets, outperforming state-of-the-art methods in both qualitative reconstructions and quantitative metrics such as Chamfer Distance and F-scores, while ablations demonstrate the superiority of QRBS over displacement fields and invertible flows. This work advances single-view articulated-object reconstruction by enabling high-fidelity 3D geometry and appearance for everyday objects, with potential impact on robotics, animation, and AR/VR applications.

Abstract

In this paper, we address the challenge of reconstructing general articulated 3D objects from a single video. Existing works employing dynamic neural radiance fields have advanced the modeling of articulated objects like humans and animals from videos, but face challenges with piece-wise rigid general articulated objects due to limitations in their deformation models. To tackle this, we propose Quasi-Rigid Blend Skinning, a novel deformation model that enhances the rigidity of each part while maintaining flexible deformation of the joints. Our primary insight combines three distinct approaches: 1) an enhanced bone rigging system for improved component modeling, 2) the use of quasi-sparse skinning weights to boost part rigidity and reconstruction fidelity, and 3) the application of geodesic point assignment for precise motion and seamless deformation. Our method outperforms previous works in producing higher-fidelity 3D reconstructions of general articulated objects, as demonstrated on both real and synthetic datasets. Project page: https://chaoyuesong.github.io/REACTO.

REACTO: Reconstructing Articulated Objects from a Single Video

TL;DR

Abstract

Paper Structure (27 sections, 11 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 11 equations, 7 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Modeling articulated objects.
Shape reconstruction from images or videos.
Neural representations for dynamic scenes.
Method
Canonical NeRF for shape and appearance
Quasi-rigid blend skinning for deformation
Motion representation.
Bone definition.
Skinning weights.
Geodesic point assignment.
Quasi-rigid blend skinning.
Volume rendering and optimization
Volume rendering.
...and 12 more sections

Figures (7)

Figure 1: Given a single casual video capturing a piece-wise rigid general articulated object, REACTO can model the 3D shape, texture, and motion. The second row presents shape reconstruction results from reference views, the third row showcases the reconstructed texture, and the fourth row displays the shapes from another view.
Figure 2: Rig on joints vs. rig on bones. A straightforward approach to control the motion of general articulated objects is to adopt methods yang2023ppr used for modeling humans or animals, which typically define the rig based on joints. This design can lead to bending shapes and corrupted motion. In contrast, we propose a novel approach by defining the rig based on bones, enhancing the rigidity and motion integrity of each component.
Figure 3: The overview of REACTO. We model an articulated 3D object from a single video using a shape and appearance model based on a canonical Neural Radiance Field (NeRF) and a deformation model for transforming 3D points between the observation space and the canonical space. Instead of linear blend skinning or dual quaternion blend skinning designed for human or animal motion modeling, we propose Quasi-Rigid Blend Skinning (QRBS) as our deformation model, with the learned quasi-sparse skinning weights, to accurately transform $\mathbf{X}^{t}$ from the observation space to $\mathbf{X}^{*}$ in the canonical space. We visualize the 3 bones for glasses in the canonical space. The colors in skinning weights signify the assigned bone for each point.
Figure 4: Geodesic distances between 3D point and bones. Geodesic distance can correctly associate the 3D point (black) with the top bone (blue) rather than the bottom bone (yellow) by following the shortest path on the mesh surface. Shorter distances indicate stronger associations.
Figure 5: Qualitative comparison of our method with BANMo yang2022banmo, MoDA song2023moda and PPR yang2023ppr. BANMo and MoDA struggle with complete shape reconstruction (real-faucet, real-scissors). Non-smooth surfaces (BANMo on real-stapler, MoDA on real-scissors, BANMo and MoDA on real-laptop) are also observed. The results of PPR are smoother but with surface tearing (real-stapler, real-scissors), over-smoothed joints (real-faucet, real-laptop, real-stapler), and inaccuracies in motion modeling (real-faucet, real-laptop). In contrast, REACTO outperforms these methods, excelling in the shape and deformation reconstruction of articulated objects. Please find the video results in the supplementary material.
...and 2 more figures

REACTO: Reconstructing Articulated Objects from a Single Video

TL;DR

Abstract

REACTO: Reconstructing Articulated Objects from a Single Video

Authors

TL;DR

Abstract

Table of Contents

Figures (7)