MoDA: Modeling Deformable 3D Objects from Casual Videos

Chaoyue Song; Jiacheng Wei; Tianyi Chen; Yiwen Chen; Chuan Sheng Foo; Fayao Liu; Guosheng Lin

MoDA: Modeling Deformable 3D Objects from Casual Videos

Chaoyue Song, Jiacheng Wei, Tianyi Chen, Yiwen Chen, Chuan Sheng Foo, Fayao Liu, Guosheng Lin

TL;DR

MoDA addresses deformable 3D object reconstruction from casual videos by coupling a canonical neural radiance field with a neural dual quaternion blend skinning (NeuDBS) deformation model to guarantee rigid transformations and avoid skin-collapsing artifacts. It registers 2D pixels to 3D canonical points through an optimal transport formulation, promoting one-to-one correspondences between image features and canonical embeddings. A texture-filtered volume rendering pipeline further reduces background noise in texture synthesis. Experiments on real and synthetic data show that MoDA achieves superior qualitative and quantitative performance for humans and animals compared with state-of-the-art methods, highlighting the practical impact for casual-video-based 3D modeling and animation.

Abstract

In this paper, we focus on the challenges of modeling deformable 3D objects from casual videos. With the popularity of neural radiance fields (NeRF), many works extend it to dynamic scenes with a canonical NeRF and a deformation model that achieves 3D point transformation between the observation space and the canonical space. Recent works rely on linear blend skinning (LBS) to achieve the canonical-observation transformation. However, the linearly weighted combination of rigid transformation matrices is not guaranteed to be rigid. As a matter of fact, unexpected scale and shear factors often appear. In practice, using LBS as the deformation model can always lead to skin-collapsing artifacts for bending or twisting motions. To solve this problem, we propose neural dual quaternion blend skinning (NeuDBS) to achieve 3D point deformation, which can perform rigid transformation without skin-collapsing artifacts. In the endeavor to register 2D pixels across different frames, we establish a correspondence between canonical feature embeddings that encodes 3D points within the canonical space, and 2D image features by solving an optimal transport problem. Besides, we introduce a texture filtering approach for texture rendering that effectively minimizes the impact of noisy colors outside target deformable objects. Extensive experiments on real and synthetic datasets show that our approach can reconstruct 3D models for humans and animals with better qualitative and quantitative performance than state-of-the-art methods. Project page: \url{https://chaoyuesong.github.io/MoDA}.

MoDA: Modeling Deformable 3D Objects from Casual Videos

TL;DR

Abstract

Paper Structure (26 sections, 28 equations, 13 figures, 3 tables)

This paper contains 26 sections, 28 equations, 13 figures, 3 tables.

Introduction
Related work
3D human and animal models
3D reconstruction from images or videos
Neural radiance fields for dynamic scenes
Correspondence Learning
Revisit linear blend skinning
Method
Shape and appearance model
Deformation model
2D-3D matching via optimal transport
Volume rendering and optimization
Experiments
Dataset, metrics, and implementation details
Comparison results on multiple videos
...and 11 more sections

Figures (13)

Figure 1: In this work, we introduce MoDA that can reconstruct deformable 3D objects from the input casual videos with neural deformation models. Deformation models are used to transform 3D points between the canonical space (rest pose) and the observation space (deformed pose). Previous work BANMo yang2022banmo uses linear blend skinning as their deformation model, resulting in visible skin-collapsing artifacts on the arms. MoDA can solve this problem with the proposed neural dual quaternion blend skinning.
Figure 2: From state a to c, BANMo and our method can both perform well for motion with small joint rotations. From state d to f, BANMo has more and more obvious skin-collapsing artifacts for motion with large rotations, our method resolves the artifacts with the proposed NeuDBS.
Figure 3: The overview of MoDA. We represent the deformable 3D objects from multiple casual videos with a shape and appearance model based on a canonical neural radiance field and a deformation model that achieves 3D point transformation between the observation space and the canonical space. Instead of linear blend skinning used in previous works, we propose NeuDBS as our deformation model. With the learned unit dual quaternions and the skinning weights, we can transform $\mathbf{X}^{t}$ from the observation space to $\mathbf{X}^{*}$ in the canonical space. We visualize the joints and the skinning weights (as surface colors) in the canonical space.
Figure 4: Qualitative comparison on multiple videos. The data is from casual-adult, casual-human, AMA-samba, casual-cat, eagle from top to bottom. The lower right corner of each reference image is the corresponding rest pose. We show 2 views of the reconstructed results based on the reference images. ViSER yang2021viser fails to learn detailed 3D shapes and accurate poses from the videos. BANMo yang2022banmo has obvious skin-collapsing artifacts (in the red circles) for motions with large joint rotations while our method performs well. For eagle with slight motion, BANMo and our method have close performance.
Figure 5: Qualitative comparison on a single video. The data is casual-adult, AMA-swing, casual-cat, eagle from top to bottom. The lower right corner of each reference image is the corresponding rest pose. We show 2 views of the reconstructed results based on the reference images. HyperNeRF park2021hypernerf fails to learn reasonable shapes and deformations. For single-video setups, BANMo yang2022banmo still has obvious skin-collapsing artifacts (in the red circles) for motions with large joint rotations while our method performs better.
...and 8 more figures

MoDA: Modeling Deformable 3D Objects from Casual Videos

TL;DR

Abstract

MoDA: Modeling Deformable 3D Objects from Casual Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (13)