Table of Contents
Fetching ...

UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control

Wenzhang Sun, Xiang Li, Donglin Di, Zhuding Liang, Qiyuan Zhang, Hao Li, Wei Chen, Jianxun Cui

TL;DR

UniAvatar tackles lifelike talking-head generation by enabling simultaneous, modular control over 3D motion and global illumination. It fuses FLAME-based 3D priors with a diffusion-based generator, introducing Motion-aware Rendering and Illumination-aware Rendering plus Masked-Cross-Source Sampling to stabilize backgrounds under varied lighting. The framework uses separate encoders for motion and lighting and injects their guidance into a cross-attention-enabled denoising network, achieving pixel-level motion control and flexible relighting. Two new datasets, DH-FaceDrasMvVid-100 and DH-FaceReliVid-200, are released to broaden motion and lighting diversity, and experiments show superior performance across multiple benchmarks and control modalities.

Abstract

Recently, animating portrait images using audio input is a popular task. Creating lifelike talking head videos requires flexible and natural movements, including facial and head dynamics, camera motion, realistic light and shadow effects. Existing methods struggle to offer comprehensive, multifaceted control over these aspects. In this work, we introduce UniAvatar, a designed method that provides extensive control over a wide range of motion and illumination conditions. Specifically, we use the FLAME model to render all motion information onto a single image, maintaining the integrity of 3D motion details while enabling fine-grained, pixel-level control. Beyond motion, this approach also allows for comprehensive global illumination control. We design independent modules to manage both 3D motion and illumination, permitting separate and combined control. Extensive experiments demonstrate that our method outperforms others in both broad-range motion control and lighting control. Additionally, to enhance the diversity of motion and environmental contexts in current datasets, we collect and plan to publicly release two datasets, DH-FaceDrasMvVid-100 and DH-FaceReliVid-200, which capture significant head movements during speech and various lighting scenarios.

UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control

TL;DR

UniAvatar tackles lifelike talking-head generation by enabling simultaneous, modular control over 3D motion and global illumination. It fuses FLAME-based 3D priors with a diffusion-based generator, introducing Motion-aware Rendering and Illumination-aware Rendering plus Masked-Cross-Source Sampling to stabilize backgrounds under varied lighting. The framework uses separate encoders for motion and lighting and injects their guidance into a cross-attention-enabled denoising network, achieving pixel-level motion control and flexible relighting. Two new datasets, DH-FaceDrasMvVid-100 and DH-FaceReliVid-200, are released to broaden motion and lighting diversity, and experiments show superior performance across multiple benchmarks and control modalities.

Abstract

Recently, animating portrait images using audio input is a popular task. Creating lifelike talking head videos requires flexible and natural movements, including facial and head dynamics, camera motion, realistic light and shadow effects. Existing methods struggle to offer comprehensive, multifaceted control over these aspects. In this work, we introduce UniAvatar, a designed method that provides extensive control over a wide range of motion and illumination conditions. Specifically, we use the FLAME model to render all motion information onto a single image, maintaining the integrity of 3D motion details while enabling fine-grained, pixel-level control. Beyond motion, this approach also allows for comprehensive global illumination control. We design independent modules to manage both 3D motion and illumination, permitting separate and combined control. Extensive experiments demonstrate that our method outperforms others in both broad-range motion control and lighting control. Additionally, to enhance the diversity of motion and environmental contexts in current datasets, we collect and plan to publicly release two datasets, DH-FaceDrasMvVid-100 and DH-FaceReliVid-200, which capture significant head movements during speech and various lighting scenarios.
Paper Structure (15 sections, 6 equations, 11 figures, 4 tables)

This paper contains 15 sections, 6 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Showcases under various control signals. Our method enabling different motion controls without failure during extensive movements, as well as allowing the flexible generation under different lighting conditions.
  • Figure 2: Showcase of publicly available datasets and our proposed datasets: We refer to datasets like HDTF and DH-FaceVid-1k as normal datasets, which contain a wide range of identity information. In contrast, our datasets offer more extensive motion variations under the same identity (DH-FaceDrasMvVid-100) and more diverse lighting conditions under the same identity (DH-FaceReliVid-200).
  • Figure 3: The overall frame work of UniAvatar. We use a Masked-Cross-Source Sampling strategy to learn the lighting information and ensure background stability. To enable independent and combined control over different conditions, we utilize separate render for each condition and dedicated modules for injecting motion and illumination conditions.
  • Figure 4: Visualization of the sampling strategy. We sample from different source videos under the same identity. To ensure background stability, we build a database of 500 background images and randomly select and composite new images.
  • Figure 5: Visual comparisons with different methods. Results demonstrate that UniAvatar surpasses other method across multiple control modalities. UniAvatar maintains stability even during large movements and provides flexible global illumination control.
  • ...and 6 more figures