Table of Contents
Fetching ...

ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild

Chen Guo, Tianjian Jiang, Manuel Kaufmann, Chengwei Zheng, Julien Valentin, Jie Song, Otmar Hilliges

TL;DR

This work establishes a layered neural human representation that decomposes clothed humans into a neural inner body and outer clothing and introduces a non-hierarchical virtual bone deformation module for the clothing layer that allows the accurate recovery of non-rigidly deforming loose clothing.

Abstract

While previous years have seen great progress in the 3D reconstruction of humans from monocular videos, few of the state-of-the-art methods are able to handle loose garments that exhibit large non-rigid surface deformations during articulation. This limits the application of such methods to humans that are dressed in standard pants or T-shirts. Our method, ReLoo, overcomes this limitation and reconstructs high-quality 3D models of humans dressed in loose garments from monocular in-the-wild videos. To tackle this problem, we first establish a layered neural human representation that decomposes clothed humans into a neural inner body and outer clothing. On top of the layered neural representation, we further introduce a non-hierarchical virtual bone deformation module for the clothing layer that can freely move, which allows the accurate recovery of non-rigidly deforming loose clothing. A global optimization jointly optimizes the shape, appearance, and deformations of the human body and clothing via multi-layer differentiable volume rendering. To evaluate ReLoo, we record subjects with dynamically deforming garments in a multi-view capture studio. This evaluation, both on existing and our novel dataset, demonstrates ReLoo's clear superiority over prior art on both indoor datasets and in-the-wild videos.

ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the Wild

TL;DR

This work establishes a layered neural human representation that decomposes clothed humans into a neural inner body and outer clothing and introduces a non-hierarchical virtual bone deformation module for the clothing layer that allows the accurate recovery of non-rigidly deforming loose clothing.

Abstract

While previous years have seen great progress in the 3D reconstruction of humans from monocular videos, few of the state-of-the-art methods are able to handle loose garments that exhibit large non-rigid surface deformations during articulation. This limits the application of such methods to humans that are dressed in standard pants or T-shirts. Our method, ReLoo, overcomes this limitation and reconstructs high-quality 3D models of humans dressed in loose garments from monocular in-the-wild videos. To tackle this problem, we first establish a layered neural human representation that decomposes clothed humans into a neural inner body and outer clothing. On top of the layered neural representation, we further introduce a non-hierarchical virtual bone deformation module for the clothing layer that can freely move, which allows the accurate recovery of non-rigidly deforming loose clothing. A global optimization jointly optimizes the shape, appearance, and deformations of the human body and clothing via multi-layer differentiable volume rendering. To evaluate ReLoo, we record subjects with dynamically deforming garments in a multi-view capture studio. This evaluation, both on existing and our novel dataset, demonstrates ReLoo's clear superiority over prior art on both indoor datasets and in-the-wild videos.
Paper Structure (26 sections, 13 equations, 8 figures, 2 tables)

This paper contains 26 sections, 13 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Method Overview. Given an image from a video sequence, we sample points along the camera ray for each neural layer. We warp sampled points for the body layer $\boldsymbol{x}_d^B$ into canonical space via inverse LBS derived from skeletal deformations. We deform sampled points for the garment layer $\boldsymbol{x}_d^G$ into canonical space via inverse warping based on the proposed virtual bone deformation module (Sec. \ref{['sec:deformation']}). We then evaluate the respective implicit network to obtain the SDF and radiance values (Sec. \ref{['sec:doublelayer']}). We apply multi-layer differentiable volume rendering to learn the shape, appearance, and deformations of the layered neural human representation from images (Sec. \ref{['sec:rendering']}). The loss function $\mathcal{L}$ compares the rendered color predictions with image observations as well as a segmentation mask obtained using SAM sam_hq (Sec. \ref{['sec:optimization']}).
  • Figure 2: Qualitative 3D surface reconstruction comparison. Baseline methods produce less detailed and implausible 3D clothed human reconstructions with visible artifacts (discontinuities between legs, missing dress parts) due to the strong reliance on skeletal deformations. In contrast, our method correctly recovers the clothing dynamics and generates more detailed and complete 3D human surfaces. Note also that ReLoo produces more detailed facial features.
  • Figure 3: Qualitative novel view synthesis comparison. Our method achieves better rendering quality with detailed texture recovery in e.g., garment patterns and faces. Baseline methods can only produce corrupted and blurry rendering results (dress discontinuities between legs and unsharp texture details).
  • Figure 4: Qualitative comparisons with template-based method. Compared to the template-based method, our representation and learning schemes enable more detailed and realistic human surface reconstruction and topological flexibility.
  • Figure 5: Importance of multi-round sampling. One-round sampling strategy can lead to physically implausible clothed human reconstructions with severe garment-body interpenetration while multi-round sampling achieves better holistic reconstructions.
  • ...and 3 more figures