Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

Taekyung Ki; Dongchan Min; Gyeongsu Chae

Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

Taekyung Ki, Dongchan Min, Gyeongsu Chae

TL;DR

This work tackles 3D-aware portrait animation with cross-identity expression transfer by addressing appearance-expression entanglement. It introduces Export3D, which uses a Contrastive Learned Basis Scaling (CLeBS) to extract appearance-free expressions and a Hybrid Tri-plane Generator with Expression Adaptive Layer Normalization (EAdaLN) to inject driving expressions into a 3D-aware tri-plane, followed by differentiable volume rendering and super-resolution. The key contributions are the CLeBS framework for appearance-free expression, the end-to-end tri-plane-based generator, and extensive experiments showing reduced appearance swap and improved 3D-view consistency in both same-identity and cross-identity settings. This approach enables one-shot, high-fidelity, 3D-aware portrait animation driven by 3DMM parameters, with practical impact for realistic avatar animation and video synthesis while acknowledging limitations in background separation and eye gaze control.

Abstract

In this paper, we present Export3D, a one-shot 3D-aware portrait animation method that is able to control the facial expression and camera view of a given portrait image. To achieve this, we introduce a tri-plane generator with an effective expression conditioning method, which directly generates a tri-plane of 3D prior by transferring the expression parameter of 3DMM into the source image. The tri-plane is then decoded into the image of different view through a differentiable volume rendering. Existing portrait animation methods heavily rely on image warping to transfer the expression in the motion space, challenging on disentanglement of appearance and expression. In contrast, we propose a contrastive pre-training framework for appearance-free expression parameter, eliminating undesirable appearance swap when transferring a cross-identity expression. Extensive experiments show that our pre-training framework can learn the appearance-free expression representation hidden in 3DMM, and our model can generate 3D-aware expression controllable portrait images without appearance swap in the cross-identity manner.

Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

TL;DR

Abstract

Paper Structure (26 sections, 14 equations, 18 figures, 5 tables)

This paper contains 26 sections, 14 equations, 18 figures, 5 tables.

Introduction
Related Works
3D-aware Image Synthesis
Portrait Image Animation
Methods
Contrastive Learned Basis Scaling (CLeBS)
Hybrid Tri-plane Generator
Volume Rendering and Super-resolution
Experiments
Dataset and Pre-processing
Evaluation
Ablation Studies and Further Results
Conclusion
Supplementary Material
3D Morphable Models (3DMM).
...and 11 more sections

Figures (18)

Figure 1: Training overview of Export3D. We convert a source image $S\in\mathbb{R}^{3\times H \times W}$ into a tri-plane $T_{\beta_D}(S)$ for rich 3D priors, conditioned on an expression parameter $\beta_{D}\in \mathbb{R}^{64}$ from a driving image $D \in \mathbb{R}^{3\times H \times W}$. A differentiable volume rendering renders the tri-plane into a raw rendered image $\hat{D}_{raw}\in \mathbb{R}^{3\times\frac{H}{4}\times\frac{W}{4}}$ using the camera parameter $p_{D} \in \mathbb{R}^{25}$ of $D$, which is then super-resolved into a final image $\hat{D} \in \mathbb{R}^{3\times H \times W}$.
Figure 1: Quantitative comparison on VFHQ. The best score for each metric is in bold. Note that we only measure CSIM arcface, AED and APD bfmpirenderer for the cross-identity experiment as no ground-truth is available. ${^\dagger}$: Evaluated only on the foreground facial region.
Figure 2: Contrastive pre-training framework for LeBS. We sample the positive and the negative samples from the same video source so that those samples share the same appearance. Using contrastive learning, the encoder $f_{e}(\cdot)$ learns an appearance-free representations.
Figure 2: Quantitative comparison on TalkingHead-1KH. The best score for each metric is in bold. Note that we only measure CSIM arcface, AED and APD bfmpirenderer for the cross-identity experiment as no ground-truth is available. ${^\dagger}$: Evaluated only on the foreground facial region.
Figure 3: Hybrid tri-plane generator $\mathbf{G}$ and Expression Adaptive Layer Normalization (EAdaLN). EAdaLN modulates the expression of $S$ using the refined expression $\beta'$ from CLeBS.
...and 13 more figures

Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

TL;DR

Abstract

Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

Authors

TL;DR

Abstract

Table of Contents

Figures (18)