Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior
Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, Chen Cao
TL;DR
Vid2Avatar-Pro addresses the challenge of generating photorealistic, animatable 3D avatars from monocular in-the-wild videos. It introduces a universal prior model (UPM) for clothed humans, built from thousands of dynamic multi-view captures using 3D Gaussians with front/back identity conditioning to support cross-identity learning. The approach personalizes the UPM to a given monocular video via inverse rendering, diffusion-based texture inpainting, and targeted fine-tuning to recover person-specific details and enable pose-able animations. Experiments on NeuMan and MonoPerfCap demonstrate state-of-the-art performance in interpolation and extrapolation, with ablations validating key components such as skeleton-based normalization, texture inpainting, and fine-tuning. The method broadens access to realistic AR/VR avatars by enabling high-quality, controllable avatars from single videos, though it currently omits facial expressions and loose clothing dynamics under standard lighting.
Abstract
We present Vid2Avatar-Pro, a method to create photorealistic and animatable 3D human avatars from monocular in-the-wild videos. Building a high-quality avatar that supports animation with diverse poses from a monocular video is challenging because the observation of pose diversity and view points is inherently limited. The lack of pose variations typically leads to poor generalization to novel poses, and avatars can easily overfit to limited input view points, producing artifacts and distortions from other views. In this work, we address these limitations by leveraging a universal prior model (UPM) learned from a large corpus of multi-view clothed human performance capture data. We build our representation on top of expressive 3D Gaussians with canonical front and back maps shared across identities. Once the UPM is learned to accurately reproduce the large-scale multi-view human images, we fine-tune the model with an in-the-wild video via inverse rendering to obtain a personalized photorealistic human avatar that can be faithfully animated to novel human motions and rendered from novel views. The experiments show that our approach based on the learned universal prior sets a new state-of-the-art in monocular avatar reconstruction by substantially outperforming existing approaches relying only on heuristic regularization or a shape prior of minimally clothed bodies (e.g., SMPL) on publicly available datasets.
