Table of Contents
Fetching ...

Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation

Zhihua Xu, Tianshui Chen, Zhijing Yang, Siyuan Peng, Keze Wang, Liang Lin

TL;DR

This work first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames, and integrates the correlations to help enhance feature representation and regularize final generation by a novel TAVCE framework.

Abstract

The paramount challenge in audio-driven One-shot Talking Head Animation (ADOS-THA) lies in capturing subtle imperceptible changes between adjacent video frames. Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames, offering supplementary information that can be pivotal for guiding and supervising talking head animations. In this work, we propose to learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation by a novel Temporal Audio-Visual Correlation Embedding (TAVCE) framework. Specifically, it first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames. Since the temporal audio relationship contains aligned information about the visual frame, we first integrate it to guide learning more representative features via a simple yet effective channel attention mechanism. During training, we also use the alignment correlations as an additional objective to supervise generating visual frames. We conduct extensive experiments on several publicly available benchmarks (i.e., HDTF, LRW, VoxCeleb1, and VoxCeleb2) to demonstrate its superiority over existing leading algorithms.

Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation

TL;DR

This work first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames, and integrates the correlations to help enhance feature representation and regularize final generation by a novel TAVCE framework.

Abstract

The paramount challenge in audio-driven One-shot Talking Head Animation (ADOS-THA) lies in capturing subtle imperceptible changes between adjacent video frames. Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames, offering supplementary information that can be pivotal for guiding and supervising talking head animations. In this work, we propose to learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation by a novel Temporal Audio-Visual Correlation Embedding (TAVCE) framework. Specifically, it first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames. Since the temporal audio relationship contains aligned information about the visual frame, we first integrate it to guide learning more representative features via a simple yet effective channel attention mechanism. During training, we also use the alignment correlations as an additional objective to supervise generating visual frames. We conduct extensive experiments on several publicly available benchmarks (i.e., HDTF, LRW, VoxCeleb1, and VoxCeleb2) to demonstrate its superiority over existing leading algorithms.

Paper Structure

This paper contains 23 sections, 6 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: An overall pipeline of the proposed Temporal Audio-Visual Correlation Embedding framework. Given a source image and the driven audio, it first extracts the image feature from the source image and predicts 3D coefficients from the audio. We then compute the temporal relationship between the current and previous audio clips, integrating this with the image feature for enhanced feature representation. Following this, the face renderer generates the final image from the image feature and the mapped 3D coefficients. Moreover, during training, the last real visual frame is used to calculate the temporal visual relationship with the generated image. The visual relationship is constrained to be similar to the audio relationship.
  • Figure 2: Illustration of the correlation-embedded representation learning.
  • Figure 3: Qualitative comparisons of state-of-the-art methods and our TAVCE framwork for audio-driven one-shot talking head generation on the HDTF and LRW dataset. Our framework delivers high-quality generations in terms of lip synchronization and overall image quality.
  • Figure 4: Qualitative comparisons on the VoxCeleb1 and VoxCeleb2 dataset. Our framework achieves high-quality talking head animations, both in terms of lip synchronization and image quality.
  • Figure 5: Comparison of learned features with and without CERL and their corresponding generated images. The learned features enhanced by CERL module emphasize more on the mouth area and thus keep more details to better keep the mouth animations.
  • ...and 1 more figures