Table of Contents
Fetching ...

SOAR: Self-Occluded Avatar Recovery from a Single Video In the Wild

Zhuoyang Pan, Angjoo Kanazawa, Hang Gao

TL;DR

Self-Occluded Avatar Recovery (SOAR) is introduced, a method for complete human reconstruction from partial observations where parts of the body are entirely unobserved, and performs favorably than state-of-the-art reconstruction and generation methods, and on-par comparing to concurrent works.

Abstract

Self-occlusion is common when capturing people in the wild, where the performer do not follow predefined motion scripts. This challenges existing monocular human reconstruction systems that assume full body visibility. We introduce Self-Occluded Avatar Recovery (SOAR), a method for complete human reconstruction from partial observations where parts of the body are entirely unobserved. SOAR leverages structural normal prior and generative diffusion prior to address such an ill-posed reconstruction problem. For structural normal prior, we model human with an reposable surfel model with well-defined and easily readable shapes. For generative diffusion prior, we perform an initial reconstruction and refine it using score distillation. On various benchmarks, we show that SOAR performs favorably than state-of-the-art reconstruction and generation methods, and on-par comparing to concurrent works. Additional video results and code are available at https://soar-avatar.github.io/.

SOAR: Self-Occluded Avatar Recovery from a Single Video In the Wild

TL;DR

Self-Occluded Avatar Recovery (SOAR) is introduced, a method for complete human reconstruction from partial observations where parts of the body are entirely unobserved, and performs favorably than state-of-the-art reconstruction and generation methods, and on-par comparing to concurrent works.

Abstract

Self-occlusion is common when capturing people in the wild, where the performer do not follow predefined motion scripts. This challenges existing monocular human reconstruction systems that assume full body visibility. We introduce Self-Occluded Avatar Recovery (SOAR), a method for complete human reconstruction from partial observations where parts of the body are entirely unobserved. SOAR leverages structural normal prior and generative diffusion prior to address such an ill-posed reconstruction problem. For structural normal prior, we model human with an reposable surfel model with well-defined and easily readable shapes. For generative diffusion prior, we perform an initial reconstruction and refine it using score distillation. On various benchmarks, we show that SOAR performs favorably than state-of-the-art reconstruction and generation methods, and on-par comparing to concurrent works. Additional video results and code are available at https://soar-avatar.github.io/.

Paper Structure

This paper contains 24 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Complete human reconstruction from partial observations in the wild. We present SOAR: Self-Occluded Avatar Recovery. Given a video of a moving human where parts of the body are entirely unobserved (left), SOAR recovers a photo-realistic avatar with complete texture and shape (right), by leveraging structural human normal prior and generative diffusion prior.
  • Figure 2: Relation to existing problems. Our problem requires combining human reconstruction from video frames and human generation for occluded regions.
  • Figure 3: System overview. Given an input video, we preprocess for frame-wise mask, front and back normal, SMPL-X parameters, as well as video-level text prompt description (Section \ref{['sec:preprocessing']}). Our model consists of a canonical Gaussian surfel representation and an articulation representation (Section \ref{['sec:avatar']}). We perform initial reconstruction while estimating occlusion, producing partially completed avatar due to the lack of observation (Section \ref{['sec:reconstruction']}), which is then refined by generative diffusion priors (Section \ref{['sec:generation']}).
  • Figure 4: Qualitative results on DNA-Rendering dataset. For each training view, we visualize the ground-truth novel view along with predicted RGB rendering and normal map from different approaches. Our method recovers photo-realistic and geometrically plausible avatars comparing to baselines. For GART and GA, we read out their normals by depth gradient dai2024highhuang20242djiang2023gaussianshader.
  • Figure 5: Comparison between our globally consistent avatar and image-to-3D baseline. Our method is able to fuse all observations from a video and allow natural reposing.
  • ...and 2 more figures