Table of Contents
Fetching ...

WonderHuman: Hallucinating Unseen Parts in Dynamic 3D Human Reconstruction

Zilong Wang, Zhiyang Dou, Yuan Liu, Cheng Lin, Xiao Dong, Yunhui Guo, Chenxu Zhang, Xin Li, Wenping Wang, Xiaohu Guo

TL;DR

WonderHuman tackles dynamic human reconstruction from monocular video by hallucinating unseen parts with diffusion-model priors. It combines 3D Gaussian Splatting with Score Distillation Sampling applied in both canonical and observation spaces (Dual-space Optimization), guided by a view-selection strategy and pose-feature injection to maintain pose-consistent fidelity. A Stage I module reconstructs visible appearance, while Stage II uses SDS-based diffusion priors to infer unseen regions, supervised by normal maps and reinforced by visibility-aware refinement; a progressive training schedule balances canonical and observed-space learning. The method achieves state-of-the-art results on unseen parts across multiple benchmarks, delivers competitive rendering speed, and demonstrates robust occlusion handling, though it remains challenged by extreme occlusion and loose garments.

Abstract

In this paper, we present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis. Previous dynamic human avatar reconstruction methods typically require the input video to have full coverage of the observed human body. However, in daily practice, one typically has access to limited viewpoints, such as monocular front-view videos, making it a cumbersome task for previous methods to reconstruct the unseen parts of the human avatar. To tackle the issue, we present WonderHuman, which leverages 2D generative diffusion model priors to achieve high-quality, photorealistic reconstructions of dynamic human avatars from monocular videos, including accurate rendering of unseen body parts. Our approach introduces a Dual-Space Optimization technique, applying Score Distillation Sampling (SDS) in both canonical and observation spaces to ensure visual consistency and enhance realism in dynamic human reconstruction. Additionally, we present a View Selection strategy and Pose Feature Injection to enforce the consistency between SDS predictions and observed data, ensuring pose-dependent effects and higher fidelity in the reconstructed avatar. In the experiments, our method achieves SOTA performance in producing photorealistic renderings from the given monocular video, particularly for those challenging unseen parts. The project page and source code can be found at https://wyiguanw.github.io/WonderHuman/.

WonderHuman: Hallucinating Unseen Parts in Dynamic 3D Human Reconstruction

TL;DR

WonderHuman tackles dynamic human reconstruction from monocular video by hallucinating unseen parts with diffusion-model priors. It combines 3D Gaussian Splatting with Score Distillation Sampling applied in both canonical and observation spaces (Dual-space Optimization), guided by a view-selection strategy and pose-feature injection to maintain pose-consistent fidelity. A Stage I module reconstructs visible appearance, while Stage II uses SDS-based diffusion priors to infer unseen regions, supervised by normal maps and reinforced by visibility-aware refinement; a progressive training schedule balances canonical and observed-space learning. The method achieves state-of-the-art results on unseen parts across multiple benchmarks, delivers competitive rendering speed, and demonstrates robust occlusion handling, though it remains challenged by extreme occlusion and loose garments.

Abstract

In this paper, we present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis. Previous dynamic human avatar reconstruction methods typically require the input video to have full coverage of the observed human body. However, in daily practice, one typically has access to limited viewpoints, such as monocular front-view videos, making it a cumbersome task for previous methods to reconstruct the unseen parts of the human avatar. To tackle the issue, we present WonderHuman, which leverages 2D generative diffusion model priors to achieve high-quality, photorealistic reconstructions of dynamic human avatars from monocular videos, including accurate rendering of unseen body parts. Our approach introduces a Dual-Space Optimization technique, applying Score Distillation Sampling (SDS) in both canonical and observation spaces to ensure visual consistency and enhance realism in dynamic human reconstruction. Additionally, we present a View Selection strategy and Pose Feature Injection to enforce the consistency between SDS predictions and observed data, ensuring pose-dependent effects and higher fidelity in the reconstructed avatar. In the experiments, our method achieves SOTA performance in producing photorealistic renderings from the given monocular video, particularly for those challenging unseen parts. The project page and source code can be found at https://wyiguanw.github.io/WonderHuman/.

Paper Structure

This paper contains 47 sections, 19 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Overview of WonderHuman. (1) In stage I, we reconstruct 3D Gaussians and appearances for visible human parts from partial-view videos. We start with optimizable feature vectors named canonical features capturing human geometry and appearance in a canonical space. Then, we use a Gaussian Decoder to predict Gaussian parameters and combine the Linear Blend Skinning (LBS) function with the Gaussian Splatting to render the dynamic 3D human in the observation space. (2) In Stage II, we hallucinate the invisible parts of the avatar using a Dual-space Optimization technique. We render images of the human avatar from various novel viewpoints and apply an SDS loss to learn the unseen appearances. Additionally, a normal predictor is utilized to generate normal maps that guide geometry reconstruction, while View Selection and Pose Feature Injection strategies are employed to ensure consistent appearance fusion.
  • Figure 2: Left side: Dual-space Optimization (a) w/o Dual-space Optimization; (b) w/ canonical optimization only; (c) w/ Dual-space Optimization; Right side: Pose Feature Injection (d) ground truth; (e) w/o pose feature injection; (f) w/ pose feature injection.
  • Figure 3: View Selection based on visibility map (a) Seen view: Visible region (orange) covers more than 50% of the foreground region; (b) Unseen view: Invisible region (blue) covers more than 50% of the foreground region.
  • Figure 4: Qualitative comparison on four datasets. We compare the novel view synthesis quality with HumanNeRF weng2022humannerf, Instant-NVR instant_nvr, SplattingAvatar shao2024splattingavatar, ExAvatar moon2024exavatar and GaussianAvatar hu2023gaussianavatar.
  • Figure 5: Qualitative comparison on three datasets. We compare the novel view synthesis quality with GuessTheUnseen lee2024gtu.
  • ...and 11 more figures