Table of Contents
Fetching ...

Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization

Matan Levy, Gavriel Habib, Issar Tzachor, Dvir Samuel, Rami Ben-Ari, Nir Darshan, Or Litany, Dani Lischinski

TL;DR

RAF (Retrieval-Augmented Faces), a simple training-time augmentation designed for template-free head avatars that learn deformation from data, is introduced and experiments demonstrate that RAF consistently improves expression fidelity over the baseline, in both self-driving and cross-driving scenarios.

Abstract

Template-free animatable head avatars can achieve high visual fidelity by learning expression-dependent facial deformation directly from a subject's capture, avoiding parametric face templates and hand-designed blendshape spaces. However, since learned deformation is supervised only by the expressions observed for a single identity, these models suffer from limited expression coverage and often struggle when driven by motions that deviate from the training distribution. We introduce RAF (Retrieval-Augmented Faces), a simple training-time augmentation designed for template-free head avatars that learn deformation from data. RAF constructs a large unlabeled expression bank and, during training, replaces a subset of the subject's expression features with nearest-neighbor expressions retrieved from this bank while still reconstructing the subject's original frames. This exposes the deformation field to a broader range of expression conditions, encouraging stronger identity-expression decoupling and improving robustness to expression distribution shift without requiring paired cross-identity data, additional annotations, or architectural changes. We further analyze how retrieval augmentation increases expression diversity and validate retrieval quality with a user study showing that retrieved neighbors are perceptually closer in expression and pose. Experiments on the NeRSemble benchmark demonstrate that RAF consistently improves expression fidelity over the baseline, in both self-driving and cross-driving scenarios.

Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization

TL;DR

RAF (Retrieval-Augmented Faces), a simple training-time augmentation designed for template-free head avatars that learn deformation from data, is introduced and experiments demonstrate that RAF consistently improves expression fidelity over the baseline, in both self-driving and cross-driving scenarios.

Abstract

Template-free animatable head avatars can achieve high visual fidelity by learning expression-dependent facial deformation directly from a subject's capture, avoiding parametric face templates and hand-designed blendshape spaces. However, since learned deformation is supervised only by the expressions observed for a single identity, these models suffer from limited expression coverage and often struggle when driven by motions that deviate from the training distribution. We introduce RAF (Retrieval-Augmented Faces), a simple training-time augmentation designed for template-free head avatars that learn deformation from data. RAF constructs a large unlabeled expression bank and, during training, replaces a subset of the subject's expression features with nearest-neighbor expressions retrieved from this bank while still reconstructing the subject's original frames. This exposes the deformation field to a broader range of expression conditions, encouraging stronger identity-expression decoupling and improving robustness to expression distribution shift without requiring paired cross-identity data, additional annotations, or architectural changes. We further analyze how retrieval augmentation increases expression diversity and validate retrieval quality with a user study showing that retrieved neighbors are perceptually closer in expression and pose. Experiments on the NeRSemble benchmark demonstrate that RAF consistently improves expression fidelity over the baseline, in both self-driving and cross-driving scenarios.
Paper Structure (24 sections, 2 equations, 6 figures, 3 tables)

This paper contains 24 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Cross-identity qualitative comparison. The avatars are taken from the NeRSemble dataset kirschstein2023nersemble. Compared to the baselines, RAF produces expressions that more closely resemble those of the driving subject.
  • Figure 2: PCA visualization of the expression bank. We project expression features into 2D with PCA. For three query expressions (stars), we highlight their $10$ nearest neighbors computed in the original BFM expression-feature space. Despite the projection, each query’s neighbors form a cluster of similar expressions, even though they belong to different identities. This indicates that the expression bank contains meaningful cross-identity matches, which enables reliable nearest-neighbor retrieval for expression substitution during training.
  • Figure 3: Human preference of most similar match to a query face, reported separately for head pose and facial expression. Nearest neighbors (NN) retrieved from a large/small expression bank, built from the same set of identities but with fewer frames.
  • Figure 4: An overview of our method. During training, we randomly replace 50% of each subject’s expression features with those of their nearest neighbors originating from different identities. This retrieval-based substitution encourages the model to generalize expressions across identities and disentangle expression from appearance. The avatar network (MLP) is subsequently trained to reconstruct the original frame conditioned on the substituted expression features, optimized under a reconstruction objective.
  • Figure 5: Self-driving qualitative comparison. Compared to the baselines, RAF produces unseen expressions that more closely resemble those of the subject.
  • ...and 1 more figures