EVA: Expressive Virtual Avatars from Multi-view Videos
Hendrik Junkawitsch, Guoxing Sun, Heming Zhu, Christian Theobalt, Marc Habermann
TL;DR
EVA tackles the challenge of creating photorealistic, fully controllable human avatars from multi-view video by decomposing the problem into a deformable expressive template and a disentangled Gaussian appearance layer. It jointly learns an expressive template mesh $\boldsymbol{\Phi}_{\mathrm{mesh}}$ and an appearance model $\boldsymbol{\Phi}_{\mathrm{app}}$, with separate body and head Gaussians driven by pose $\boldsymbol{\theta}$ and expression $\boldsymbol{\psi}$ inputs, enabling independent control of body, hands, and face. The method combines motion optimization, multi-stage head fitting, and dual-branch Gaussian prediction trained on dense multi-view data, achieving real-time rendering (over 30 fps) and superior rendering quality compared to state-of-the-art baselines such as DDC and ASH. By handling loose clothing through learned clothing dynamics and incorporating a global lighting latent $\Psi$, EVA supports relighting and realistic appearance under varied studio conditions. This work advances toward fully drivable digital humans for XR, telepresence, and virtual production, while acknowledging limitations in topology changes and lighting modeling that future work could address.
Abstract
With recent advancements in neural rendering and motion capture algorithms, remarkable progress has been made in photorealistic human avatar modeling, unlocking immense potential for applications in virtual reality, augmented reality, remote communication, and industries such as gaming, film, and medicine. However, existing methods fail to provide complete, faithful, and expressive control over human avatars due to their entangled representation of facial expressions and body movements. In this work, we introduce Expressive Virtual Avatars (EVA), an actor-specific, fully controllable, and expressive human avatar framework that achieves high-fidelity, lifelike renderings in real time while enabling independent control of facial expressions, body movements, and hand gestures. Specifically, our approach designs the human avatar as a two-layer model: an expressive template geometry layer and a 3D Gaussian appearance layer. First, we present an expressive template tracking algorithm that leverages coarse-to-fine optimization to accurately recover body motions, facial expressions, and non-rigid deformation parameters from multi-view videos. Next, we propose a novel decoupled 3D Gaussian appearance model designed to effectively disentangle body and facial appearance. Unlike unified Gaussian estimation approaches, our method employs two specialized and independent modules to model the body and face separately. Experimental results demonstrate that EVA surpasses state-of-the-art methods in terms of rendering quality and expressiveness, validating its effectiveness in creating full-body avatars. This work represents a significant advancement towards fully drivable digital human models, enabling the creation of lifelike digital avatars that faithfully replicate human geometry and appearance.
