Table of Contents
Fetching ...

FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision

Tobias Kirschstein, Simon Giebenhain, Matthias Nießner

TL;DR

<3-5 sentence high-level summary>FlexAvatar tackles the problem of creating complete, animatable 3D head avatars from a single image by addressing the entanglement between driving signal and target viewpoint in monocular training. It introduces learnable bias sinks that separate monocular and multi-view data influence, enabling unified training while yielding complete 3D reconstructions at inference time. The architecture combines a transformer-based encoder $E$, a decoder $D$ that outputs articulated 3D Gaussians, and a StyleGAN-PixelShuffle upsampler, trained on diverse datasets to produce a smooth latent avatar space that supports identity interpolation and fast fitting. Across 3D portrait animation, single-image, few-shot, and monocular avatar creation tasks, FlexAvatar demonstrates strong generalization and render quality, with fast adaptation and minimal data requirements for high-fidelity avatars.

Abstract

We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations. Website: https://tobias-kirschstein.github.io/flexavatar/

FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision

TL;DR

<3-5 sentence high-level summary>FlexAvatar tackles the problem of creating complete, animatable 3D head avatars from a single image by addressing the entanglement between driving signal and target viewpoint in monocular training. It introduces learnable bias sinks that separate monocular and multi-view data influence, enabling unified training while yielding complete 3D reconstructions at inference time. The architecture combines a transformer-based encoder , a decoder that outputs articulated 3D Gaussians, and a StyleGAN-PixelShuffle upsampler, trained on diverse datasets to produce a smooth latent avatar space that supports identity interpolation and fast fitting. Across 3D portrait animation, single-image, few-shot, and monocular avatar creation tasks, FlexAvatar demonstrates strong generalization and render quality, with fast adaptation and minimal data requirements for high-fidelity avatars.

Abstract

We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations. Website: https://tobias-kirschstein.github.io/flexavatar/

Paper Structure

This paper contains 36 sections, 17 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: FlexAvatar. From just a single portrait image of a person, FlexAvatar creates a high quality 3D head avatar representation that can be freely animated and rendered from diverse viewpoints. Our model can be flexibly applied to other scenarios including creating avatars from a phone scan or from monocular videos. The entire avatar creation process can be executed within minutes.
  • Figure 2: Method Overview of FlexAvatar. Given the single input image $I$, our method allows to change both viewpoint $\pi$ and facial expression $z_{exp}$. The transformer-based encoder $E$ first produces a compressed avatar code $\mathcal{A}$ via cross-attention. The decoder $D$ then incorporates the effect of the facial expression $z_{exp}$ into the avatar representation. Crucially, the corresponding bias sinks are concatenated to the expression tokens: $z_{2D}$ if the input image $I$ comes from a monocular dataset, and $z_{3D}$ if it comes from a multi-view dataset. Finally, the upsampled avatar code is decoded into the 3D Gaussian attributes for rendering. During training, the bias sinks absorb data modality-specific biases such as the entanglement of driver expression and target viewpoint of monocular datasets. At inference time, only $z_{3D}$ is used to inherit the disentangled behavior of multi-view datasets yielding both generalized and complete 3D head avatars.
  • Figure 3: Architecture of the StyleGAN-PixelShuffle block.
  • Figure 4: Entanglement of driving signal and target viewpoint. Naive training on monocular data works well as long as both expression code $z_{drive}$ and rendering camera $\pi_{target}$ are transferred to the avatar ($\pi_{target} = \pi_{drive}$). Artifacts occur when the rendering camera is moved, i.e., rendering and driving viewpoint differ ($\pi_{target} \neq \pi_{drive}$). This issue is fixed by our proposed bias sinks.
  • Figure 5: Qualitative Single-image Avatar Creation comparison on the Ava256 dataset. We compare our method to the recent state-of-the-art on 3D head avatar creation from a single portrait image. Our method produces more complete 3D head avatars and re-enacts the target expression more faithfully.
  • ...and 7 more figures