Table of Contents
Fetching ...

DressRecon: Freeform 4D Human Reconstruction from Monocular Video

Jeff Tan, Donglai Xiang, Shubham Tulsiani, Deva Ramanan, Gengshan Yang

TL;DR

The Dress-Recon method, a method to reconstruct time-consistent human body models from monocular videos, focusing on extremely loose clothing or handheld object interactions, yields higher-fidelity 3D reconstructions than prior art.

Abstract

We present a method to reconstruct time-consistent human body models from monocular videos, focusing on extremely loose clothing or handheld object interactions. Prior work in human reconstruction is either limited to tight clothing with no object interactions, or requires calibrated multi-view captures or personalized template scans which are costly to collect at scale. Our key insight for high-quality yet flexible reconstruction is the careful combination of generic human priors about articulated body shape (learned from large-scale training data) with video-specific articulated "bag-of-bones" deformation (fit to a single video via test-time optimization). We accomplish this by learning a neural implicit model that disentangles body versus clothing deformations as separate motion model layers. To capture subtle geometry of clothing, we leverage image-based priors such as human body pose, surface normals, and optical flow during optimization. The resulting neural fields can be extracted into time-consistent meshes, or further optimized as explicit 3D Gaussians for high-fidelity interactive rendering. On datasets with highly challenging clothing deformations and object interactions, DressRecon yields higher-fidelity 3D reconstructions than prior art. Project page: https://jefftan969.github.io/dressrecon/

DressRecon: Freeform 4D Human Reconstruction from Monocular Video

TL;DR

The Dress-Recon method, a method to reconstruct time-consistent human body models from monocular videos, focusing on extremely loose clothing or handheld object interactions, yields higher-fidelity 3D reconstructions than prior art.

Abstract

We present a method to reconstruct time-consistent human body models from monocular videos, focusing on extremely loose clothing or handheld object interactions. Prior work in human reconstruction is either limited to tight clothing with no object interactions, or requires calibrated multi-view captures or personalized template scans which are costly to collect at scale. Our key insight for high-quality yet flexible reconstruction is the careful combination of generic human priors about articulated body shape (learned from large-scale training data) with video-specific articulated "bag-of-bones" deformation (fit to a single video via test-time optimization). We accomplish this by learning a neural implicit model that disentangles body versus clothing deformations as separate motion model layers. To capture subtle geometry of clothing, we leverage image-based priors such as human body pose, surface normals, and optical flow during optimization. The resulting neural fields can be extracted into time-consistent meshes, or further optimized as explicit 3D Gaussians for high-fidelity interactive rendering. On datasets with highly challenging clothing deformations and object interactions, DressRecon yields higher-fidelity 3D reconstructions than prior art. Project page: https://jefftan969.github.io/dressrecon/
Paper Structure (17 sections, 13 equations, 7 figures, 7 tables)

This paper contains 17 sections, 13 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Given an input video of a human, DressRecon reconstructs a time-consistent 4D body model, including shape, appearance, time-varying body articulations, as well as deformation of extremely loose clothing or accessory objects. We propose a hierarchical bag-of-bones deformation model that allows body and clothing motion to be separated. We leverage image-based priors such as human body pose, surface normals, and optical flow to make optimization more tractable. The resulting neural fields can be extracted into time-consistent meshes, or further optimized as explicit 3D Gaussians for high-fidelity interactive rendering.
  • Figure 2: Method Overview: We represent 3D humans in loose clothing as temporally consistent 4D neural fields (Sec. \ref{['sec: preliminary']}). Central to our approach is a flexible motion representation that captures fine-grained clothing deformations as well as limb motions, while effectively utilizing domain-specific priors such as 3D human body pose (Sec. \ref{['sec: motion']}). We perform video-specific optimization that fits this model to dense image-based priors via differentiable rendering (Sec. \ref{['sec: optimization']}). After optimization, our neural implicit surface can be extracted into a time-consistent mesh via marching cubes, or converted into explicit 3D Gaussians for high-fidelity interactive rendering (Sec. \ref{['sec: refinement']}).
  • Figure 3: Visualization of two-layer deformation. The body and clothing deformation layers each contribute separate types of motion. In this sequence, the clothing Gaussians deform the woman's dress to be larger, while the body Gaussians move her right arm forward. During forward warping, we start from the canonical shape (left), and first apply the forward warp described by clothing Gaussians, then the forward warp described by body Gaussians. The same process happens in reverse during backward warping.
  • Figure 4: 3D reconstruction results on DNA-Rendering. We demonstrate DressRecon's ability to reconstruct challenging sequences with large cloth deformation. DressRecon's predictions align well with the image evidence, even in the presence of rapid clothing and object deformations. Vid2Avatar often outputs spurious shape artifacts and is unable to reconstruct challenging structures, such as the white cloth (row 2), brown brush (row 3), and detailed sleeves (row 4). BANMo and RAC produce hollow cellos on the first row, and tend to output over-smoothed surfaces for the other cases. ECON produces highly detailed textures, but it performs the worst numerically (Tab. \ref{['tab:chamfer_dna_rendering']}) as the outputs often have an incorrect overall shape (e.g. Row 1). We encourage readers to view the video results on the supplementary webpage.
  • Figure 5: 3D reconstruction results on ActorsHQ. DressRecon is on par with Vid2Avatar for standard clothing (Rows 2 and 4), and higher fidelity than Vid2Avatar for loose clothing (Rows 1 and 3). Vid2Avatar's reconstructed skirts often contain shape artifacts. We attribute DressRecon's improved performance to its flexible shape and deformation representation, which is capable of representing non-standard geometry and deformation.
  • ...and 2 more figures