Table of Contents
Fetching ...

EVA: Expressive Virtual Avatars from Multi-view Videos

Hendrik Junkawitsch, Guoxing Sun, Heming Zhu, Christian Theobalt, Marc Habermann

TL;DR

EVA tackles the challenge of creating photorealistic, fully controllable human avatars from multi-view video by decomposing the problem into a deformable expressive template and a disentangled Gaussian appearance layer. It jointly learns an expressive template mesh $\boldsymbol{\Phi}_{\mathrm{mesh}}$ and an appearance model $\boldsymbol{\Phi}_{\mathrm{app}}$, with separate body and head Gaussians driven by pose $\boldsymbol{\theta}$ and expression $\boldsymbol{\psi}$ inputs, enabling independent control of body, hands, and face. The method combines motion optimization, multi-stage head fitting, and dual-branch Gaussian prediction trained on dense multi-view data, achieving real-time rendering (over 30 fps) and superior rendering quality compared to state-of-the-art baselines such as DDC and ASH. By handling loose clothing through learned clothing dynamics and incorporating a global lighting latent $\Psi$, EVA supports relighting and realistic appearance under varied studio conditions. This work advances toward fully drivable digital humans for XR, telepresence, and virtual production, while acknowledging limitations in topology changes and lighting modeling that future work could address.

Abstract

With recent advancements in neural rendering and motion capture algorithms, remarkable progress has been made in photorealistic human avatar modeling, unlocking immense potential for applications in virtual reality, augmented reality, remote communication, and industries such as gaming, film, and medicine. However, existing methods fail to provide complete, faithful, and expressive control over human avatars due to their entangled representation of facial expressions and body movements. In this work, we introduce Expressive Virtual Avatars (EVA), an actor-specific, fully controllable, and expressive human avatar framework that achieves high-fidelity, lifelike renderings in real time while enabling independent control of facial expressions, body movements, and hand gestures. Specifically, our approach designs the human avatar as a two-layer model: an expressive template geometry layer and a 3D Gaussian appearance layer. First, we present an expressive template tracking algorithm that leverages coarse-to-fine optimization to accurately recover body motions, facial expressions, and non-rigid deformation parameters from multi-view videos. Next, we propose a novel decoupled 3D Gaussian appearance model designed to effectively disentangle body and facial appearance. Unlike unified Gaussian estimation approaches, our method employs two specialized and independent modules to model the body and face separately. Experimental results demonstrate that EVA surpasses state-of-the-art methods in terms of rendering quality and expressiveness, validating its effectiveness in creating full-body avatars. This work represents a significant advancement towards fully drivable digital human models, enabling the creation of lifelike digital avatars that faithfully replicate human geometry and appearance.

EVA: Expressive Virtual Avatars from Multi-view Videos

TL;DR

EVA tackles the challenge of creating photorealistic, fully controllable human avatars from multi-view video by decomposing the problem into a deformable expressive template and a disentangled Gaussian appearance layer. It jointly learns an expressive template mesh and an appearance model , with separate body and head Gaussians driven by pose and expression inputs, enabling independent control of body, hands, and face. The method combines motion optimization, multi-stage head fitting, and dual-branch Gaussian prediction trained on dense multi-view data, achieving real-time rendering (over 30 fps) and superior rendering quality compared to state-of-the-art baselines such as DDC and ASH. By handling loose clothing through learned clothing dynamics and incorporating a global lighting latent , EVA supports relighting and realistic appearance under varied studio conditions. This work advances toward fully drivable digital humans for XR, telepresence, and virtual production, while acknowledging limitations in topology changes and lighting modeling that future work could address.

Abstract

With recent advancements in neural rendering and motion capture algorithms, remarkable progress has been made in photorealistic human avatar modeling, unlocking immense potential for applications in virtual reality, augmented reality, remote communication, and industries such as gaming, film, and medicine. However, existing methods fail to provide complete, faithful, and expressive control over human avatars due to their entangled representation of facial expressions and body movements. In this work, we introduce Expressive Virtual Avatars (EVA), an actor-specific, fully controllable, and expressive human avatar framework that achieves high-fidelity, lifelike renderings in real time while enabling independent control of facial expressions, body movements, and hand gestures. Specifically, our approach designs the human avatar as a two-layer model: an expressive template geometry layer and a 3D Gaussian appearance layer. First, we present an expressive template tracking algorithm that leverages coarse-to-fine optimization to accurately recover body motions, facial expressions, and non-rigid deformation parameters from multi-view videos. Next, we propose a novel decoupled 3D Gaussian appearance model designed to effectively disentangle body and facial appearance. Unlike unified Gaussian estimation approaches, our method employs two specialized and independent modules to model the body and face separately. Experimental results demonstrate that EVA surpasses state-of-the-art methods in terms of rendering quality and expressiveness, validating its effectiveness in creating full-body avatars. This work represents a significant advancement towards fully drivable digital human models, enabling the creation of lifelike digital avatars that faithfully replicate human geometry and appearance.

Paper Structure

This paper contains 56 sections, 20 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: Method Overview. EVA generates high-fidelity renderings from a virtual viewpoint, skeletal motion, and expression parameters. Using a personalized head avatar and a deformable character model, we control body movements and facial expressions to drive an actor-specific mesh. This mesh generates motion-aware textures, and separate modules independently predict the Gaussian parameters for the face and body. The 3D Gaussians are combined, UV-mapped, and warped from canonical to posed space via dual quaternion skinning. Finally, they are splatted to render the final photorealistic image.
  • Figure 2: Illustration of the head fitting: (b-d) results after each iteration of our freeze and refine optimization strategy, with (d) the optimized FLAME model in a neutral expression. Black regions indicate the optimized areas for each iteration. (e) displays added vertex displacements with corresponding fitting weights, where green indicates high and red low weights.
  • Figure 3: Qualitative result and comparison of our expressive mesh template and appearance (green box) with the ground truth image (GT), alongside results from a character animated using only an underlying skeleton with simple forward kinematics (FK) and dual quaternion skinning (DQS) Kavan2007, as well as the template mesh employed by ASH Pang_2024_CVPR, which closely resembles the standard DDC habermann2021DDC mesh but includes control over the hands. Our approach not only controls body pose and hand gestures but also defines facial expressions. Additionally, we showcase the outcomes of our motion optimization strategy, which significantly enhances the alignment between our expressive template mesh and the underlying multi-view images.
  • Figure 4: Qualitative results of the complete EVA model. The first set of results demonstrates EVA’s ability to render images of a character with previously observed motions and expressions from novel viewpoints. The second set showcases EVA’s performance in scenarios combining novel viewpoints, unseen motions, and new expressions. Notably, we illustrate EVA's capability to modify expressions independently of body appearance, enabled by our explicitly modeled disentanglement between body and head. This disentanglement is a key contribution of our appearance model. EVA demonstrates the capability to render high-fidelity images of humans, effectively capturing motion-dependent details while robustly handling unseen expression parameters.
  • Figure 5: Qualitative comparison of EVA with two real-time human rendering approaches: ASH Pang_2024_CVPR and DDC habermann2021DDC. By conditioning our appearance model on the character's facial expression, EVA successfully predicts high-fidelity facial appearance in novel viewpoints, as well as for unseen motions and expressions. Additionally, our enhanced training strategy, supplementary regularization techniques, and the introduction of the IDMRF loss significantly improve both the quality of facial rendering and the overall accuracy of the generated images, outperforming ASH and DDC in terms of rendering fidelity.
  • ...and 15 more figures