Table of Contents
Fetching ...

Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization

Bin Lin, Yanzhen Yu, Jianhao Ye, Ruitao Lv, Yuguang Yang, Ruoye Xie, Pan Yu, Hongbin Zhou

TL;DR

Takin-ADA is presented, a novel two-stage approach for real-time audio-driven portrait animation that enhances subtle expression transfer while reducing unwanted expression leakage and utilizes an advanced audio processing technique to improve lip-sync accuracy.

Abstract

Existing audio-driven facial animation methods face critical challenges, including expression leakage, ineffective subtle expression transfer, and imprecise audio-driven synchronization. We discovered that these issues stem from limitations in motion representation and the lack of fine-grained control over facial expressions. To address these problems, we present Takin-ADA, a novel two-stage approach for real-time audio-driven portrait animation. In the first stage, we introduce a specialized loss function that enhances subtle expression transfer while reducing unwanted expression leakage. The second stage utilizes an advanced audio processing technique to improve lip-sync accuracy. Our method not only generates precise lip movements but also allows flexible control over facial expressions and head motions. Takin-ADA achieves high-resolution (512x512) facial animations at up to 42 FPS on an RTX 4090 GPU, outperforming existing commercial solutions. Extensive experiments demonstrate that our model significantly surpasses previous methods in video quality, facial dynamics realism, and natural head movements, setting a new benchmark in the field of audio-driven facial animation.

Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization

TL;DR

Takin-ADA is presented, a novel two-stage approach for real-time audio-driven portrait animation that enhances subtle expression transfer while reducing unwanted expression leakage and utilizes an advanced audio processing technique to improve lip-sync accuracy.

Abstract

Existing audio-driven facial animation methods face critical challenges, including expression leakage, ineffective subtle expression transfer, and imprecise audio-driven synchronization. We discovered that these issues stem from limitations in motion representation and the lack of fine-grained control over facial expressions. To address these problems, we present Takin-ADA, a novel two-stage approach for real-time audio-driven portrait animation. In the first stage, we introduce a specialized loss function that enhances subtle expression transfer while reducing unwanted expression leakage. The second stage utilizes an advanced audio processing technique to improve lip-sync accuracy. Our method not only generates precise lip movements but also allows flexible control over facial expressions and head motions. Takin-ADA achieves high-resolution (512x512) facial animations at up to 42 FPS on an RTX 4090 GPU, outperforming existing commercial solutions. Extensive experiments demonstrate that our model significantly surpasses previous methods in video quality, facial dynamics realism, and natural head movements, setting a new benchmark in the field of audio-driven facial animation.

Paper Structure

This paper contains 17 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: We introduce Takin-ADA, a framework that transforms input audio and a single static portrait into animated talking videos with naturally flowing movements. Each column of generated results utilizes identical control signals with different and expressions but incorporates some random variations, demonstrating the diversity of our generated outcomes.
  • Figure 2: Illustration of our proposed Takin-ADA. The framework comprises two primary components: (1) a representation learning module for extracting expressive and disentangled facial latent representations, and (2) a sequence generation module that synthesizes motion sequences based on audio input. The first component focuses on learning robust motion representations through the utilization of canonical keypoint loss and landmark guidance. Subsequently, these learned motion representations serve as input for the second component, enabling further audio-drive facial image generation and manipulation
  • Figure 3: Qualitative comparisons of Cross-reenactment. This task involves transferring actions from a source portrait to a target portrait to evaluate each algorithm's ability to separate motion and appearance. The results highlight our method's superior ability in both motion transfer and appearance retention, while also excelling in the transfer of subtle micro-expressions and extreme facial expressions.
  • Figure 4: Visual comparison of the speech-driven method. Phonetic sounds are highlighted in red.
  • Figure 5: Generated results under different emotion offset (happy, surprised, sad, angry and disgusted, respectively).