Table of Contents
Fetching ...

Embedded Representation Learning Network for Animating Styled Video Portrait

Tianyong Wang, Xiangyu Liang, Wangguandong Zheng, Dan Niu, Haifeng Xia, Siyu Xia

TL;DR

ERLNet tackles style-controllable talking head synthesis by introducing a two-stage framework: an Audio Driven FLAME (ADF) module that maps audio and style video into FLAME coefficient sequences using a pair of transformer-based VQ-VAEs, and a Dual-Branch Fusion NeRF (DBF-NeRF) that renders heads and torsos from FLAME inputs with a specialized feature fusion strategy to reduce neck artifacts. The method leverages FLAME coefficients as stable intermediate representations and employs a deform module and perceptual losses to enhance realism. Experiments on LDST, MEAD, and HDTF demonstrate superior image quality and realistic style rendering, with ablations validating the necessity of the DBF-NeRF architecture, contrastive speech loss, and perceptual supervision. The work also contributes a long-duration styled talking head dataset (LDST) and shows practical gains in multimodal, style-aware video portrait generation.

Abstract

The talking head generation recently attracted considerable attention due to its widespread application prospects, especially for digital avatars and 3D animation design. Inspired by this practical demand, several works explored Neural Radiance Fields (NeRF) to synthesize the talking heads. However, these methods based on NeRF face two challenges: (1) Difficulty in generating style-controllable talking heads. (2) Displacement artifacts around the neck in rendered images. To overcome these two challenges, we propose a novel generative paradigm \textit{Embedded Representation Learning Network} (ERLNet) with two learning stages. First, the \textit{ audio-driven FLAME} (ADF) module is constructed to produce facial expression and head pose sequences synchronized with content audio and style video. Second, given the sequence deduced by the ADF, one novel \textit{dual-branch fusion NeRF} (DBF-NeRF) explores these contents to render the final images. Extensive empirical studies demonstrate that the collaboration of these two stages effectively facilitates our method to render a more realistic talking head than the existing algorithms.

Embedded Representation Learning Network for Animating Styled Video Portrait

TL;DR

ERLNet tackles style-controllable talking head synthesis by introducing a two-stage framework: an Audio Driven FLAME (ADF) module that maps audio and style video into FLAME coefficient sequences using a pair of transformer-based VQ-VAEs, and a Dual-Branch Fusion NeRF (DBF-NeRF) that renders heads and torsos from FLAME inputs with a specialized feature fusion strategy to reduce neck artifacts. The method leverages FLAME coefficients as stable intermediate representations and employs a deform module and perceptual losses to enhance realism. Experiments on LDST, MEAD, and HDTF demonstrate superior image quality and realistic style rendering, with ablations validating the necessity of the DBF-NeRF architecture, contrastive speech loss, and perceptual supervision. The work also contributes a long-duration styled talking head dataset (LDST) and shows practical gains in multimodal, style-aware video portrait generation.

Abstract

The talking head generation recently attracted considerable attention due to its widespread application prospects, especially for digital avatars and 3D animation design. Inspired by this practical demand, several works explored Neural Radiance Fields (NeRF) to synthesize the talking heads. However, these methods based on NeRF face two challenges: (1) Difficulty in generating style-controllable talking heads. (2) Displacement artifacts around the neck in rendered images. To overcome these two challenges, we propose a novel generative paradigm \textit{Embedded Representation Learning Network} (ERLNet) with two learning stages. First, the \textit{ audio-driven FLAME} (ADF) module is constructed to produce facial expression and head pose sequences synchronized with content audio and style video. Second, given the sequence deduced by the ADF, one novel \textit{dual-branch fusion NeRF} (DBF-NeRF) explores these contents to render the final images. Extensive empirical studies demonstrate that the collaboration of these two stages effectively facilitates our method to render a more realistic talking head than the existing algorithms.
Paper Structure (35 sections, 19 equations, 9 figures, 3 tables)

This paper contains 35 sections, 19 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of ERLNet. EMOCA EMOCA and Wav2Vec2 wav2vec2 are employed for extracting FLAME coefficients and audio features. During the inference process, our approach requires two inputs: a style reference video and a content audio. After passing through the ADF module, a FLAME coefficients sequence matched to the speech is obtained. Subsequently, the DBF-NeRF network takes these FLAME coefficients as input to render the final video.
  • Figure 2: Structure of FLAME latent space. Expression sequence $z_{exp}^{1:t}$ and head pose variation sequence $z_{\Delta pose}^{1:t}$ of length $t$ are inputs. Two individual VQ-VAEs are used to build the expression space and the head pose space. Emotion vector $z_{emo}$ and the head pose of the first frame $z_{pose}^0$ are utilized as conditional inputs. Only decoders and codebooks are frozen and used in subsequent networks.
  • Figure 3: Structure of ADF. With expression sequence $z_{exp}^{1:t}$ and head pose variation sequence $z_{\Delta pose}^{1:t}$ of length $t$ as inputs, two encoders $E_{se}$ and $E_{sp}$ are employed to extract expression style feature $s_{exp}$ and head pose style feature $s_p$. Then, audio feature $z_{audio}^{1:n}$ of length $n$, $s_{exp}$ and $s_p$ are combined. Two decoders $D_{se}$ and $D_{sp}$ map them to the pre-trained codebooks. After that, pre-trained decoders $D_{exp}$ and $D_{pose}$ are employed to generate the final expressions and poses. Emotion vector and the initial pose are utilized as conditional inputs.
  • Figure 4: Overview of DBF-NeRF. Given expression $z_{exp}$, head pose $z_{pose}$, and initial head pose $z_{pose}^0$, HeadNeRF and StaticNeRF generate two feature maps and density maps. Subsequently, the two sets of feature maps and density maps are fused using a density-based approach. Finally, after passing through a series of CNN upsampling layers, the ultimate high-resolution image is obtained.
  • Figure 5: Qualitative Evaluation. We selected two style reference videos from the LDST and MEAD datasets and chose two subjects for our experiments. The first row corresponds to the style reference video, while the third row corresponds to the ground truth of mouth movements. The penultimate line corresponds to the results obtained by DBF-NeRF, while the last line represents the results generated by the complete ERLNet.
  • ...and 4 more figures