Table of Contents
Fetching ...

Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation

Kaiwen Jiang, Xueting Li, Seonwook Park, Ravi Ramamoorthi, Shalini De Mello, Koki Nagano

TL;DR

This work presents Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation, a fast feedforward pipeline that encodes a single image into a 3D Gaussian-based avatar and animates it under a driving image. By distilling expressive priors from a 2D diffusion model into a per-Gaussian feature-space deformation framework, the method achieves strong 3D consistency and fine-grained expressions while running at over 100 FPS. The approach leverages a lightweight, per-Gaussian motion basis and a diffusion-based training curriculum with synthetic data to bypass costly global fusion. It significantly surpasses prior 2D and 3D methods in both quality and speed, enabling real-time digital-twin, telepresence, and AR/VR applications, though it relies on synthetic distillation data and a 3D lifting backbone which may introduce biases. The work highlights a practical path to real-time, expressive 4D avatars distilled from powerful diffusion priors, with potential extensions to lighting disentanglement and multi-modal driving.

Abstract

Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods -- built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting -- ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face's 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa. Project website is https://research.nvidia.com/labs/amri/projects/instant4d

Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation

TL;DR

This work presents Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation, a fast feedforward pipeline that encodes a single image into a 3D Gaussian-based avatar and animates it under a driving image. By distilling expressive priors from a 2D diffusion model into a per-Gaussian feature-space deformation framework, the method achieves strong 3D consistency and fine-grained expressions while running at over 100 FPS. The approach leverages a lightweight, per-Gaussian motion basis and a diffusion-based training curriculum with synthetic data to bypass costly global fusion. It significantly surpasses prior 2D and 3D methods in both quality and speed, enabling real-time digital-twin, telepresence, and AR/VR applications, though it relies on synthetic distillation data and a 3D lifting backbone which may introduce biases. The work highlights a practical path to real-time, expressive 4D avatars distilled from powerful diffusion priors, with potential extensions to lighting disentanglement and multi-modal driving.

Abstract

Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods -- built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting -- ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face's 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa. Project website is https://research.nvidia.com/labs/amri/projects/instant4d

Paper Structure

This paper contains 31 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: We present an instant feedforward encoder that transforms an in-the-wild source image into an animatable 3D avatar by distilling knowledge from a pre-trained 2D diffusion model. Our method introduces a fast, consistent yet expressive 3D animation representation. Given a driving image, we evaluate both the expression transfer quality and the animation speed (measured as "FPS" on an NVIDIA 6000 Ada GPU) against (a) 2D diffusion- or GAN-based methods and (b) 3D-aware methods. In the first row, Portrait4D-v2 deng2024portrait4dv2, GAGAvatar gagavatar and VOODOO-XP voodooxp fail to faithfully transfer expressions, particularly around the nasal wrinkles. LivePortrait guo2024liveportrait is inaccurate at eyes. In the second row, the baby wears a fake mustache as a decoration. X-NeMo distorts identity and adds a hallucinated mustache. Other methods cannot deal with the asymmetric expression in the driving image well. The FPS marked with $*$ reports inference time excluding time-consuming morphable model fitting optimization required for the method. In contrast, ours not only accurately transfers expressions but also achieves high animation speed and consistent pose control. Insets show our rendered results under different poses.
  • Figure 2: We provide a visualization of the quantitative comparison in terms of 3D inconsistency (measured by MEt3R $\downarrow$), expression transfer inaccuracy (measured by AED $\downarrow$) and animation speed (measured by FPS $\uparrow$, visualized as the size of the circle) with other 2D- or 3D-based baselines, including zhao2024invertavatardeng2024portrait4dv2voodooxpxu2025hunyuanportraitgagavatarguo2024liveportraitzhao2025x, using the task of cross-reenactment. 2D methods tend to appear on the upper left (better expression transfer accuracy; worse 3D consistency) while 3D methods tend to appear on the lower right (worse expression transfer accuracy; better 3D consistency). Our method is 3-4 orders of magnitude faster than diffusion based models xu2025hunyuanportraitzhao2025x while simultaneously achieving better 3D consistency and expression transfer accuracy.
  • Figure 3: Overview of our training pipeline with the two-part self-reenactment task. (a) Reconstruction: Given a frontalized source frame with an expression synthesized by a pre-trained diffusion model zhao2025x, we first use a multi-view estimator lp3d to generate its another viewpoint $I_s$. The encoder $E$ converts $I_s$ into triplanes, from which we sample feature vectors $f_1,f_2,\ldots$ and paired motion basis vectors $m_1,m_2,\ldots$. A Gaussian decoder $\Phi$ maps these features into a set of 3D Gaussians, forming a lifted 3D avatar for $I_s$, which we render at the viewpoint of $I_s$ as $I_s'$. (b) Animation: For the synthesized driving frame of the same identity but with a different expression, we similarly obtain its another viewpoint image $I_d$. Both $I_s$ and $I_d$ are input into the motion encoder $M$ to produce motion coefficients $M(I_s)$ and $M(I_d)$. They are concatenated to condition a motion decoder $\Psi$ to predict residual features $\delta f_1, \delta f_2,\ldots$ from paired motion basis vectors. Adding these residuals to the original features and decoding them with $\Phi$ yields an animated set of Gaussians, which we render at the viewpoint of $I_d$ as $I_d'$. The loss is computed between $(I_s, I_s')$ and $(I_d, I_d')$. Fire icons denote trainable modules; snow icons denote frozen pre-trained modules.
  • Figure 4: Conceptual comparison between predicting residual features per Gaussian versus per grid point on the triplanes deng2024portrait4dv2voodooxp in the case of realizing the expression of opening the mouth. (a) In our framework, the 3D Gaussian can be transformed independently from the red point to the blue point because its motion basis vector encodes all necessary motion information. (b) In contrast, existing triplanes-based works require aggregating dense global context, to update the features on each grid point. For example, it needs to fuse the shape information from the global context through the attention mechanism to decide whether the mouth will reach the red point and therefore update its geometry or not.
  • Figure 5: Demonstration of the similarity among motion basis vectors within and across subjects. Given the source images, we render the motion basis vectors of their Gaussian kernels via splatting. For the first-row subject, we select three specific points (red points) and compute the cosine similarity between their motion basis vectors and those of all other locations. We then compute the cosine similarity across subjects between these same points of the first subject and all motion basis vectors of the second subject in the second row. The resulting similarity maps show that our model learns coherent, semantically-meaningful and localized motion basis vectors.
  • ...and 3 more figures