Table of Contents
Fetching ...

Audio-driven High-resolution Seamless Talking Head Video Editing via StyleGAN

Jiacheng Su, Kunhong Liu, Liyan Chen, Junfeng Yao, Qingsong Liu, Dongdong Lv

TL;DR

This work tackles high-resolution audio-driven talking head video editing by introducing a two-module framework that combines an Audio-to-Landmark (AL) module and a Landmark-based Editing (LE) module. The AL module uses Cross-Reconstructed Emotion Disentanglement to separate emotion from content and a Cross-Attention alignment network to produce pose-aligned facial landmarks from audio, enabling emotion-aware guidance. The LE module inverts frames into StyleGAN's $\W^+$ latent space, optimizes latent codes with a multi-term loss (e.g., $L_{LPIPS}$, $L_2$, $L_{smooth}$) and uses stitching tuning to produce seamless, high-fidelity edits, preserving identity while ensuring temporal coherence. Experiments on MEAD and HDTF show superior high-resolution quality and lip-sync accuracy compared to state-of-the-art methods, highlighting practical potential for high-fidelity digital humans with controllable expressions.

Abstract

The existing methods for audio-driven talking head video editing have the limitations of poor visual effects. This paper tries to tackle this problem through editing talking face images seamless with different emotions based on two modules: (1) an audio-to-landmark module, consisting of the CrossReconstructed Emotion Disentanglement and an alignment network module. It bridges the gap between speech and facial motions by predicting corresponding emotional landmarks from speech; (2) a landmark-based editing module edits face videos via StyleGAN. It aims to generate the seamless edited video consisting of the emotion and content components from the input audio. Extensive experiments confirm that compared with state-of-the-arts methods, our method provides high-resolution videos with high visual quality.

Audio-driven High-resolution Seamless Talking Head Video Editing via StyleGAN

TL;DR

This work tackles high-resolution audio-driven talking head video editing by introducing a two-module framework that combines an Audio-to-Landmark (AL) module and a Landmark-based Editing (LE) module. The AL module uses Cross-Reconstructed Emotion Disentanglement to separate emotion from content and a Cross-Attention alignment network to produce pose-aligned facial landmarks from audio, enabling emotion-aware guidance. The LE module inverts frames into StyleGAN's latent space, optimizes latent codes with a multi-term loss (e.g., , , ) and uses stitching tuning to produce seamless, high-fidelity edits, preserving identity while ensuring temporal coherence. Experiments on MEAD and HDTF show superior high-resolution quality and lip-sync accuracy compared to state-of-the-art methods, highlighting practical potential for high-fidelity digital humans with controllable expressions.

Abstract

The existing methods for audio-driven talking head video editing have the limitations of poor visual effects. This paper tries to tackle this problem through editing talking face images seamless with different emotions based on two modules: (1) an audio-to-landmark module, consisting of the CrossReconstructed Emotion Disentanglement and an alignment network module. It bridges the gap between speech and facial motions by predicting corresponding emotional landmarks from speech; (2) a landmark-based editing module edits face videos via StyleGAN. It aims to generate the seamless edited video consisting of the emotion and content components from the input audio. Extensive experiments confirm that compared with state-of-the-arts methods, our method provides high-resolution videos with high visual quality.
Paper Structure (12 sections, 8 equations, 5 figures, 2 tables)

This paper contains 12 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Our method fits the generated frame to the landmark predicted from the given audio.
  • Figure 2: The framework of our method. Our method is divided into two parts: (1) Audio-to-Landmark Module; (2) Landmark-based Editing Module, which contains three steps: a) Inversion, b) Optimization, c) Stitching Tuning.
  • Figure 3: Qualitative comparisons with the state-of-the-art methods. Three examples with different speech content in HDTF dataset, comparing with Wav2Lip, VideoReTalking, and StyleHEAT.
  • Figure 4: More results in MEAD dataset comparing with Wav2Lipprajwal2020lip, VideoReTalkingcheng2022videoretalking.
  • Figure 5: Emotional editing. We show the frames generated with linear emotional Variation.