MOSS: Motion-based 3D Clothed Human Synthesis from Monocular Video

Hongsheng Wang; Xiang Cai; Xi Sun; Jinhong Yue; Zhanyun Tang; Shengyu Zhang; Feng Lin; Fei Wu

MOSS: Motion-based 3D Clothed Human Synthesis from Monocular Video

Hongsheng Wang, Xiang Cai, Xi Sun, Jinhong Yue, Zhanyun Tang, Shengyu Zhang, Feng Lin, Fei Wu

TL;DR

Experimental results demonstrate that MOSS achieves state-of-the-art visual quality in 3D clothed human synthesis from monocular videos, and improves the Human NeRF and the Gaussian Splatting by 33.94% and 16.75% in LPIPS* respectively.

Abstract

Single-view clothed human reconstruction holds a central position in virtual reality applications, especially in contexts involving intricate human motions. It presents notable challenges in achieving realistic clothing deformation. Current methodologies often overlook the influence of motion on surface deformation, resulting in surfaces lacking the constraints imposed by global motion. To overcome these limitations, we introduce an innovative framework, Motion-Based 3D Clo}thed Humans Synthesis (MOSS), which employs kinematic information to achieve motion-aware Gaussian split on the human surface. Our framework consists of two modules: Kinematic Gaussian Locating Splatting (KGAS) and Surface Deformation Detector (UID). KGAS incorporates matrix-Fisher distribution to propagate global motion across the body surface. The density and rotation factors of this distribution explicitly control the Gaussians, thereby enhancing the realism of the reconstructed surface. Additionally, to address local occlusions in single-view, based on KGAS, UID identifies significant surfaces, and geometric reconstruction is performed to compensate for these deformations. Experimental results demonstrate that MOSS achieves state-of-the-art visual quality in 3D clothed human synthesis from monocular videos. Notably, we improve the Human NeRF and the Gaussian Splatting by 33.94% and 16.75% in LPIPS* respectively. Codes are available at https://wanghongsheng01.github.io/MOSS/.

MOSS: Motion-based 3D Clothed Human Synthesis from Monocular Video

TL;DR

Abstract

Paper Structure (54 sections, 42 equations, 9 figures, 6 tables)

This paper contains 54 sections, 42 equations, 9 figures, 6 tables.

Introduction
Related Work
3D Reconstruction and Rendering
Human Reconstruction
Surface Reconstruction
Method
Overview
Preliminary
SMPL and LBS transformation
Matrix-Fisher distribution
3D Gaussian Splatting
Kinematic Gaussian Locating Splatting(KGAS)
Global motion extraction
Motion-Based 3D Gaussian Splatting
Surface Deformation Detector(UID)
...and 39 more sections

Figures (9)

Figure 1: MOSS reconstructs 3D clothed humans with detailed joints and fine clothing folds. The right image demonstrates that MOSS surpasses the visual quality of previous works on MonoCap. (LPIPS* = LPIPS × 10$^{3}$). Larger circles denote higher FPS.
Figure 2: MOSS framework.Moss conditions the Fisher distribution of child joints on the Fisher distribution of their parent joints within the kinematic hierarchy tree, thereby linking the rotational matrices of each joint to the global motion by Joint-Driven Orientation Refinement. The UID is employed to detect and locate areas with numerous surface folds on the human body. In these regions, the Gaussians are scaled by the axial matrix from the SVD of the Fisher and rotated by the directional matrix predicted from the Fisher using KGAS. The T-pose is then converted to the target pose, and the surface folds are refined accordingly.
Figure 3: Fisher's Gaussian Sampling. This is a 2D example of how spindle concentration affects Gaussian sampling. note that the color bar represents the probability of sampling, with darker colors representing higher probabilities.
Figure 4: Solving occlusion problems with UID (2D). There is a potential problem of smaller folds being occluded by obvious folds due to the viewing angle. By calculating the degree of directional change in the local distribution of Gaussians, the regions with large deformation on the surface are localized and densely processed.
Figure 5: To ensure a fair comparison, we compare NeuralBody_ZJU-MoCapHumanNeRFAnimateNeRFInstantNVRHu2023GauHumanAGLi2023Human101T1 at 512 $\times$ 512 resolution. Our model shows better visual quality and more detail.
...and 4 more figures

MOSS: Motion-based 3D Clothed Human Synthesis from Monocular Video

TL;DR

Abstract

MOSS: Motion-based 3D Clothed Human Synthesis from Monocular Video

Authors

TL;DR

Abstract

Table of Contents

Figures (9)