D$^3$-Human: Dynamic Disentangled Digital Human from Monocular Video

Honghu Chen; Bo Peng; Yunfan Tao; Juyong Zhang

D$^3$-Human: Dynamic Disentangled Digital Human from Monocular Video

Honghu Chen, Bo Peng, Yunfan Tao, Juyong Zhang

TL;DR

D^3-Human tackles the challenge of reconstructing decoupled clothing and underlying body from monocular video by marrying explicit and implicit representations. It introduces the human manifold signed distance field ($hmSDF$) to segment visible clothing and body on the clothed surface, while relying on SMPL-based completion for invisible regions, and uses dual non-rigid deformation fields to model clothing and body separately. A region-aggregation step fixes segmentation holes due to parsing noise, and occlusion-aware differentiable rendering ensures consistent 2D supervision for both layers. The method achieves fast template generation, enables high-fidelity decoupled geometry, and supports applications like clothing transfer and physics-based animation, advancing editable digital avatars from a single camera.

Abstract

We introduce D$^3$-Human, a method for reconstructing Dynamic Disentangled Digital Human geometry from monocular videos. Past monocular video human reconstruction primarily focuses on reconstructing undecoupled clothed human bodies or only reconstructing clothing, making it difficult to apply directly in applications such as animation production. The challenge in reconstructing decoupled clothing and body lies in the occlusion caused by clothing over the body. To this end, the details of the visible area and the plausibility of the invisible area must be ensured during the reconstruction process. Our proposed method combines explicit and implicit representations to model the decoupled clothed human body, leveraging the robustness of explicit representations and the flexibility of implicit representations. Specifically, we reconstruct the visible region as SDF and propose a novel human manifold signed distance field (hmSDF) to segment the visible clothing and visible body, and then merge the visible and invisible body. Extensive experimental results demonstrate that, compared with existing reconstruction schemes, D$^3$-Human can achieve high-quality decoupled reconstruction of the human body wearing different clothing, and can be directly applied to clothing transfer and animation.

D$^3$-Human: Dynamic Disentangled Digital Human from Monocular Video

TL;DR

) to segment visible clothing and body on the clothed surface, while relying on SMPL-based completion for invisible regions, and uses dual non-rigid deformation fields to model clothing and body separately. A region-aggregation step fixes segmentation holes due to parsing noise, and occlusion-aware differentiable rendering ensures consistent 2D supervision for both layers. The method achieves fast template generation, enables high-fidelity decoupled geometry, and supports applications like clothing transfer and physics-based animation, advancing editable digital avatars from a single camera.

Abstract

We introduce D

-Human, a method for reconstructing Dynamic Disentangled Digital Human geometry from monocular videos. Past monocular video human reconstruction primarily focuses on reconstructing undecoupled clothed human bodies or only reconstructing clothing, making it difficult to apply directly in applications such as animation production. The challenge in reconstructing decoupled clothing and body lies in the occlusion caused by clothing over the body. To this end, the details of the visible area and the plausibility of the invisible area must be ensured during the reconstruction process. Our proposed method combines explicit and implicit representations to model the decoupled clothed human body, leveraging the robustness of explicit representations and the flexibility of implicit representations. Specifically, we reconstruct the visible region as SDF and propose a novel human manifold signed distance field (hmSDF) to segment the visible clothing and visible body, and then merge the visible and invisible body. Extensive experimental results demonstrate that, compared with existing reconstruction schemes, D

-Human can achieve high-quality decoupled reconstruction of the human body wearing different clothing, and can be directly applied to clothing transfer and animation.

Paper Structure (17 sections, 12 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 17 sections, 12 equations, 9 figures, 1 table, 1 algorithm.

Introduction
Related Work
Method
HmSDF Representation
Region Aggregation.
Deformation Fields
Occlusion-Aware Differentiable Rendering
Training
Training Strategy
Reconstruction Loss
Regularization Term
Experiments
Quantitative Evaluation.
Qualitative Evaluation.
Ablation Study.
...and 2 more sections

Figures (9)

Figure 1: D3-Human can (b) reconstruct disentangled clothing and body from (a) input video, enabling (c) animation and (d) clothing transfer after reconstruction. Project page: https://ustc3dv.github.io/D3Human/.
Figure 2: Overview of $\text{D}^{3}$-Human. The optimization process is divided into two steps: template generation and detailed deformation. The object is initialized as a DMTet shen2021dmtet representation, and is optimized to form a complete clothed human. An optimizable HmSDF function separates the clothing and body regions, with missing parts filled by SMPL. After generating the disentangled template, we use two MLPs to model detailed deformations for each frame of the body and clothing meshes separately. Finally, the meshes are transformed to the observed space using a forward LBS deformation, supervised by images, normal maps, and parsing masks with a differentiable renderer.
Figure 3: Schematic of region aggregation. For the correct segmentation results, $S_b$ and $S_c$ correctly segment the body and the cloth. For inaccurate segmentation results, $S_c^{"}$ should merge with $S_b^{'}$, and $S_b^{"}$ should merge with $S_c^{'}$.
Figure 4: Occlusion display of the mask. From left to right: the color image of the captured clothed human, the complete clothed body mask obtained from SAM2 ravi2024sam2, the clothing mask obtained from SAM2, the mask obtained by rendering only the clothing mesh, and the mask of the effective clothing area after rendering the complete clothed body mesh.
Figure 5: Quantitative comparison of the proposed method with REC-MV qiu2023RECMV, BCNet jiang2020bcnet, DELTA Feng2023DELTA, SelfRecon jiang2022selfrecon, and GoMAvatar wen2024gomavatar. We use purple to visualize clothing that can be decoupled from the body. For REC-MV and BCNet, SMPL loper2015smpl was added as the body to show the complete reconstruction of the clothed human.
...and 4 more figures

D$^3$-Human: Dynamic Disentangled Digital Human from Monocular Video

TL;DR

Abstract

D$^3$-Human: Dynamic Disentangled Digital Human from Monocular Video

Authors

TL;DR

Abstract

Table of Contents

Figures (9)