Table of Contents
Fetching ...

FLAP: Fully-controllable Audio-driven Portrait Video Generation through 3D head conditioned diffusion model

Lingzhou Mu, Baiji Liu, Ruonan Zhang, Guiming Mo, Jiawei Jin, Kai Zhang, Haozhi Huang

TL;DR

FLAP addresses the limited controllability of diffusion-based portrait generation by conditioning a diffusion network on explicit 3D head coefficients derived from the FLAME model. It introduces a 3D head coefficient conditioning mechanism, an audio-to-FLAME module to bind lip-sync and expressions to audio, and a Progressively Focused Training scheme to decouple head pose and facial expressions. The approach yields high naturalness with precise 6DoF head motion control and independent expression manipulation, demonstrated across diverse datasets and compared favorably to several baselines and landmark-based methods. FLAP's flexibility, including compatibility with existing FLAME-based heads and its ability to integrate with alternative coefficient or feature-vector conditions, positions it as a practical tool for filmmaking, live streaming, and other real-world applications where controllable, high-quality talking-head video is required.

Abstract

Diffusion-based video generation techniques have significantly improved zero-shot talking-head avatar generation, enhancing the naturalness of both head motion and facial expressions. However, existing methods suffer from poor controllability, making them less applicable to real-world scenarios such as filmmaking and live streaming for e-commerce. To address this limitation, we propose FLAP, a novel approach that integrates explicit 3D intermediate parameters (head poses and facial expressions) into the diffusion model for end-to-end generation of realistic portrait videos. The proposed architecture allows the model to generate vivid portrait videos from audio while simultaneously incorporating additional control signals, such as head rotation angles and eye-blinking frequency. Furthermore, the decoupling of head pose and facial expression allows for independent control of each, offering precise manipulation of both the avatar's pose and facial expressions. We also demonstrate its flexibility in integrating with existing 3D head generation methods, bridging the gap between 3D model-based approaches and end-to-end diffusion techniques. Extensive experiments show that our method outperforms recent audio-driven portrait video models in both naturalness and controllability.

FLAP: Fully-controllable Audio-driven Portrait Video Generation through 3D head conditioned diffusion model

TL;DR

FLAP addresses the limited controllability of diffusion-based portrait generation by conditioning a diffusion network on explicit 3D head coefficients derived from the FLAME model. It introduces a 3D head coefficient conditioning mechanism, an audio-to-FLAME module to bind lip-sync and expressions to audio, and a Progressively Focused Training scheme to decouple head pose and facial expressions. The approach yields high naturalness with precise 6DoF head motion control and independent expression manipulation, demonstrated across diverse datasets and compared favorably to several baselines and landmark-based methods. FLAP's flexibility, including compatibility with existing FLAME-based heads and its ability to integrate with alternative coefficient or feature-vector conditions, positions it as a practical tool for filmmaking, live streaming, and other real-world applications where controllable, high-quality talking-head video is required.

Abstract

Diffusion-based video generation techniques have significantly improved zero-shot talking-head avatar generation, enhancing the naturalness of both head motion and facial expressions. However, existing methods suffer from poor controllability, making them less applicable to real-world scenarios such as filmmaking and live streaming for e-commerce. To address this limitation, we propose FLAP, a novel approach that integrates explicit 3D intermediate parameters (head poses and facial expressions) into the diffusion model for end-to-end generation of realistic portrait videos. The proposed architecture allows the model to generate vivid portrait videos from audio while simultaneously incorporating additional control signals, such as head rotation angles and eye-blinking frequency. Furthermore, the decoupling of head pose and facial expression allows for independent control of each, offering precise manipulation of both the avatar's pose and facial expressions. We also demonstrate its flexibility in integrating with existing 3D head generation methods, bridging the gap between 3D model-based approaches and end-to-end diffusion techniques. Extensive experiments show that our method outperforms recent audio-driven portrait video models in both naturalness and controllability.

Paper Structure

This paper contains 34 sections, 4 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Additional Results on Multi-pose Generation
  • Figure 2: Model architecture of FLAP. The main diffusion net accept a generated 3D parameter input, which can also be modified and biased by user input. For audio only scenarios, FLAP utilizes an audio-to-FLAME module, which generate FLAME coefficients from audio. Our main U-net block consists of 4 layers, namely, head motion layer, spatial attention layer, expression layer and temporal layer. These layers are trained in different stages with different input conditions as we proposed Progressively Focused Training scheme, which will be discussed in \ref{['sec:ablation']}
  • Figure 2: Visual result of integrating talking style from zhang2023sadtalker into Audio-DVPwen2020photorealistic and cooperate with FLAP. The first row demonstrate original FLAP, while the second row shows the result of utilizing Audio-DVP + style control as a replacement of Audio-to-FLAME module.
  • Figure 3: Qualitative comparisons. Our model achieves the most accurate pose control, visual quality and lip synchronization.
  • Figure 3: Visual results of FLAP framework cooperating with PD-FGC wang2023progressive by retraining FLAP using feature vectors from PD-FGC as conditions. We perform pose alignment of driving image using pose vector from PD-FGC.
  • ...and 9 more figures