Table of Contents
Fetching ...

Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions

Malte Prinzler, Egor Zakharov, Vanessa Sklyarova, Berna Kabadayi, Justus Thies

TL;DR

Joker, a new method for the conditional synthesis of 3D human heads with extreme expressions, is introduced and is the first to achieve viewconsistent extreme tongue articulation.

Abstract

We introduce Joker, a new method for the conditional synthesis of 3D human heads with extreme expressions. Given a single reference image of a person, we synthesize a volumetric human head with the reference identity and a new expression. We offer control over the expression via a 3D morphable model (3DMM) and textual inputs. This multi-modal conditioning signal is essential since 3DMMs alone fail to define subtle emotional changes and extreme expressions, including those involving the mouth cavity and tongue articulation. Our method is built upon a 2D diffusion-based prior that generalizes well to out-of-domain samples, such as sculptures, heavy makeup, and paintings while achieving high levels of expressiveness. To improve view consistency, we propose a new 3D distillation technique that converts predictions of our 2D prior into a neural radiance field (NeRF). Both the 2D prior and our distillation technique produce state-of-the-art results, which are confirmed by our extensive evaluations. Also, to the best of our knowledge, our method is the first to achieve view-consistent extreme tongue articulation.

Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions

TL;DR

Joker, a new method for the conditional synthesis of 3D human heads with extreme expressions, is introduced and is the first to achieve viewconsistent extreme tongue articulation.

Abstract

We introduce Joker, a new method for the conditional synthesis of 3D human heads with extreme expressions. Given a single reference image of a person, we synthesize a volumetric human head with the reference identity and a new expression. We offer control over the expression via a 3D morphable model (3DMM) and textual inputs. This multi-modal conditioning signal is essential since 3DMMs alone fail to define subtle emotional changes and extreme expressions, including those involving the mouth cavity and tongue articulation. Our method is built upon a 2D diffusion-based prior that generalizes well to out-of-domain samples, such as sculptures, heavy makeup, and paintings while achieving high levels of expressiveness. To improve view consistency, we propose a new 3D distillation technique that converts predictions of our 2D prior into a neural radiance field (NeRF). Both the 2D prior and our distillation technique produce state-of-the-art results, which are confirmed by our extensive evaluations. Also, to the best of our knowledge, our method is the first to achieve view-consistent extreme tongue articulation.

Paper Structure

This paper contains 31 sections, 1 equation, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Method Overview. We train a 2D diffusion-based prior for novel pose and expression synthesis from a single reference image. It is controlled through text prompts and 3DMM parameters. We leverage this 2D prior to optimize a Neural Radiance Field (NeRF) mildenhall2020nerf with a novel two-stage distillation procedure. During Stage 1, the NeRF is optimized against single-step-denoised predictions of the 2D prior that are recalculated every $N$ optimization iterations. In Stage 2, the target images are calculated once in a multi-step denoising process and kept fixed during the NeRF optimization.
  • Figure 2: 3DMM- and text-guided 3D reconstruction. Through text guidance our model resolves ambiguities in the 3DMM control signal, can formulate tongue articulation, and provides fine-grained emotion control. Note that the 3DMM input is kept fixed for both 3D reconstructions of each row and only the text prompt changes.
  • Figure 3: Out-of-distribution 3D reconstruction examples.
  • Figure 4: Comparison of our 2D diffusion prior for self- and cross-reenactment (row 1-2 and 3-4 respectively).
  • Figure 5: Comparison of 3D reconstructions from different distillation procedures. *: Replacing the method's 2D prior with our model for fair comparison.
  • ...and 8 more figures