Table of Contents
Fetching ...

Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior

Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, Chen Cao

TL;DR

Vid2Avatar-Pro addresses the challenge of generating photorealistic, animatable 3D avatars from monocular in-the-wild videos. It introduces a universal prior model (UPM) for clothed humans, built from thousands of dynamic multi-view captures using 3D Gaussians with front/back identity conditioning to support cross-identity learning. The approach personalizes the UPM to a given monocular video via inverse rendering, diffusion-based texture inpainting, and targeted fine-tuning to recover person-specific details and enable pose-able animations. Experiments on NeuMan and MonoPerfCap demonstrate state-of-the-art performance in interpolation and extrapolation, with ablations validating key components such as skeleton-based normalization, texture inpainting, and fine-tuning. The method broadens access to realistic AR/VR avatars by enabling high-quality, controllable avatars from single videos, though it currently omits facial expressions and loose clothing dynamics under standard lighting.

Abstract

We present Vid2Avatar-Pro, a method to create photorealistic and animatable 3D human avatars from monocular in-the-wild videos. Building a high-quality avatar that supports animation with diverse poses from a monocular video is challenging because the observation of pose diversity and view points is inherently limited. The lack of pose variations typically leads to poor generalization to novel poses, and avatars can easily overfit to limited input view points, producing artifacts and distortions from other views. In this work, we address these limitations by leveraging a universal prior model (UPM) learned from a large corpus of multi-view clothed human performance capture data. We build our representation on top of expressive 3D Gaussians with canonical front and back maps shared across identities. Once the UPM is learned to accurately reproduce the large-scale multi-view human images, we fine-tune the model with an in-the-wild video via inverse rendering to obtain a personalized photorealistic human avatar that can be faithfully animated to novel human motions and rendered from novel views. The experiments show that our approach based on the learned universal prior sets a new state-of-the-art in monocular avatar reconstruction by substantially outperforming existing approaches relying only on heuristic regularization or a shape prior of minimally clothed bodies (e.g., SMPL) on publicly available datasets.

Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior

TL;DR

Vid2Avatar-Pro addresses the challenge of generating photorealistic, animatable 3D avatars from monocular in-the-wild videos. It introduces a universal prior model (UPM) for clothed humans, built from thousands of dynamic multi-view captures using 3D Gaussians with front/back identity conditioning to support cross-identity learning. The approach personalizes the UPM to a given monocular video via inverse rendering, diffusion-based texture inpainting, and targeted fine-tuning to recover person-specific details and enable pose-able animations. Experiments on NeuMan and MonoPerfCap demonstrate state-of-the-art performance in interpolation and extrapolation, with ablations validating key components such as skeleton-based normalization, texture inpainting, and fine-tuning. The method broadens access to realistic AR/VR avatars by enabling high-quality, controllable avatars from single videos, though it currently omits facial expressions and loose clothing dynamics under standard lighting.

Abstract

We present Vid2Avatar-Pro, a method to create photorealistic and animatable 3D human avatars from monocular in-the-wild videos. Building a high-quality avatar that supports animation with diverse poses from a monocular video is challenging because the observation of pose diversity and view points is inherently limited. The lack of pose variations typically leads to poor generalization to novel poses, and avatars can easily overfit to limited input view points, producing artifacts and distortions from other views. In this work, we address these limitations by leveraging a universal prior model (UPM) learned from a large corpus of multi-view clothed human performance capture data. We build our representation on top of expressive 3D Gaussians with canonical front and back maps shared across identities. Once the UPM is learned to accurately reproduce the large-scale multi-view human images, we fine-tune the model with an in-the-wild video via inverse rendering to obtain a personalized photorealistic human avatar that can be faithfully animated to novel human motions and rendered from novel views. The experiments show that our approach based on the learned universal prior sets a new state-of-the-art in monocular avatar reconstruction by substantially outperforming existing approaches relying only on heuristic regularization or a shape prior of minimally clothed bodies (e.g., SMPL) on publicly available datasets.

Paper Structure

This paper contains 30 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Method Overview. a) We employ a large corpus of multi-view dynamic clothed human performances to train a cross-identity universal prior model (UPM). During training, UPM is conditioned on the normalized identity-specific texture map $\mathcal{T}_c$ and takes the posed position map $\mathcal{P}_d(\boldsymbol{\Theta})$ as input to predict Gaussian attributes. We extract the canonical 3D Gaussians and synthesize human rendering for training pose/shape parameters $\boldsymbol{\Theta}$ by applying forward LBS and rasterization. We minimize the loss $\mathcal{L}$ over the entire universal human corpus. b) Given a monocular in-the-wild video of an unseen identity, we track the human pose/shape parameters $\boldsymbol{\Theta}$ and reconstruct the canonical textured template. We further deploy a diffusion-based model tailored for canonical texture inpainting to complete the canonical texture map. We then fine-tune our pre-trained UPM on the monocular observations via inverse rendering to recover person-specific details.
  • Figure 2: Qualitative interpolation synthesis comparisons on NeuMan dataset. Baseline methods tend to render with artifacts (e.g., corrupted faces and feet) and less details (e.g., the clothing wrinkles and the clothes zipper). In contrast, our method generates clean and realistic human renderings while recovering more appearance details (e.g., facial features).
  • Figure 3: Qualitative extrapolation synthesis comparisons on MonoPerfCap dataset. In this more challenging setting, our method produces more identity-preserving human renderings with finer-grained appearance details, e.g., the facial features and shoes. Powered by our universal prior model, Vid2Avatar-Pro generates more plausible pose-dependent surface deformations (cf. the hem of the T-shirt).
  • Figure 4: Importance of skeleton-based normalization. Without skeleton-based conditioning data normalization, the created avatar tends to produce artifacts when driven with out-of-distribution poses (e.g., the holes close to the armpit).
  • Figure 5: Importance of inpainting. Our diffusion-based inpainting module can effectively complete the textures that are missing from the monocular observations.
  • ...and 2 more figures