Table of Contents
Fetching ...

AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors

Aymen Mir, Riza Alp Guler, Xiangjun Tang, Peter Wonka, Gerard Pons-Moll

Abstract

We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input-a fully visible subject, often in a canonical pose-excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses fundamental challenges: large body regions may never be observed, and multi-view supervision per pose is unavailable. We address these challenges with four contributions: (i) a hallucination-as-supervision pipeline that uses identity-finetuned diffusion models to generate dense supervision for previously unobserved body regions; (ii) a two-stage canonical-to-pose-dependent architecture that bootstraps from sparse observations to full pose-dependent Gaussian maps; (iii) a map-pose/LBS-pose decoupling that absorbs multi-view inconsistencies from the generated data; (iv) a head/body split supervision strategy that preserves facial identity. We evaluate on YouTube videos and on multi-view capture data with significant occlusion and demonstrate state-of-the-art reconstruction quality. We also demonstrate that the resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video. Our project page is available at https://miraymen.github.io/ahoy/

AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors

Abstract

We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input-a fully visible subject, often in a canonical pose-excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses fundamental challenges: large body regions may never be observed, and multi-view supervision per pose is unavailable. We address these challenges with four contributions: (i) a hallucination-as-supervision pipeline that uses identity-finetuned diffusion models to generate dense supervision for previously unobserved body regions; (ii) a two-stage canonical-to-pose-dependent architecture that bootstraps from sparse observations to full pose-dependent Gaussian maps; (iii) a map-pose/LBS-pose decoupling that absorbs multi-view inconsistencies from the generated data; (iv) a head/body split supervision strategy that preserves facial identity. We evaluate on YouTube videos and on multi-view capture data with significant occlusion and demonstrate state-of-the-art reconstruction quality. We also demonstrate that the resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video. Our project page is available at https://miraymen.github.io/ahoy/
Paper Structure (25 sections, 8 equations, 8 figures, 3 tables)

This paper contains 25 sections, 8 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Given a monocular YouTube video with occlusion, AHOY reconstructs a complete, animatable 3D human avatar using video diffusion priors and 3D Gaussian Splatting, enabling photorealistic human animation within 3D scenes.
  • Figure 2: Method overview.Block 1 (Sec. \ref{['sec:coarse_avatar']}): We map observed textures from partially occluded video onto a canonical pose via DensePose UV correspondences, inpaint missing regions with FLUX, and generate multi-view canonical-pose images with multi-view diffusion to supervise a coarse 3DGS avatar using canonical Gaussian maps. Block 2 (Sec. \ref{['sec:hallucinated_supervision']}): We finetune a video diffusion model (Wan 2.2) with LoRA to capture the subject's identity, render the coarse avatar under structured motion sequences, and refine these renderings via RF-Inversion through the identity-finetuned latent space to produce hallucinated supervision videos. Block 3 (Sec. \ref{['sec:full_avatar']}): The hallucinated videos supervise a full avatar with pose-dependent Gaussian maps, where per-frame poses and cameras are jointly optimized to absorb multi-view inconsistencies; a separate FLAME-based head path preserves facial identity. Block 4 (Sec. \ref{['sec:animation']}): At inference, the avatar is driven by novel poses and composited into 3DGS scenes.
  • Figure 3: Animation comparison (Zoom In). Top: canonical-pose input. Bottom (below line): occluded input. Ours produces higher-fidelity avatars in both settings.
  • Figure 4: Static reconstruction comparison (Zoom In). Top: canonical-pose input. Bottom: occluded input. Our animatable avatar is posed to match the target view.
  • Figure 5: Reconstruction under difficult fully visible poses. Static methods struggle to reconstruct humans under difficult poses, while our method allows reposing of 3DGS avatars to match the target pose.
  • ...and 3 more figures