Table of Contents
Fetching ...

FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis

Ziqi Ni, Ao Fu, Yi Zhou

TL;DR

FREAK addresses the challenge of high-fidelity lip synchronization in audio-driven talking portraits by introducing frequency-domain learning. It leverages two novel modules, Visual Encoding Frequency Modulator (VEFM) and Audio Visual Frequency Modulator (AVFM), to modulate visual features and audio-visual interactions in the Fourier domain, while optimizing in both the pixel and frequency spaces with $L_{rec}$, $L_{percep}$, and $L_{freq}$. The approach supports both one-shot generation and video dubbing, delivering real-time, high-resolution results that surpass state-of-the-art methods in fidelity and lip-sync, as demonstrated by comprehensive quantitative, qualitative, and user studies. This frequency-domain perspective offers a new direction for talking-head synthesis with practical impact for digital avatars and media production, while prompting attention to potential misuse and the need for deepfake detection tools.

Abstract

Achieving high-fidelity lip-speech synchronization in audio-driven talking portrait synthesis remains challenging. While multi-stage pipelines or diffusion models yield high-quality results, they suffer from high computational costs. Some approaches perform well on specific individuals with low resources, yet still exhibit mismatched lip movements. The aforementioned methods are modeled in the pixel domain. We observed that there are noticeable discrepancies in the frequency domain between the synthesized talking videos and natural videos. Currently, no research on talking portrait synthesis has considered this aspect. To address this, we propose a FREquency-modulated, high-fidelity, and real-time Audio-driven talKing portrait synthesis framework, named FREAK, which models talking portraits from the frequency domain perspective, enhancing the fidelity and naturalness of the synthesized portraits. FREAK introduces two novel frequency-based modules: 1) the Visual Encoding Frequency Modulator (VEFM) to couple multi-scale visual features in the frequency domain, better preserving visual frequency information and reducing the gap in the frequency spectrum between synthesized and natural frames. and 2) the Audio Visual Frequency Modulator (AVFM) to help the model learn the talking pattern in the frequency domain and improve audio-visual synchronization. Additionally, we optimize the model in both pixel domain and frequency domain jointly. Furthermore, FREAK supports seamless switching between one-shot and video dubbing settings, offering enhanced flexibility. Due to its superior performance, it can simultaneously support high-resolution video results and real-time inference. Extensive experiments demonstrate that our method synthesizes high-fidelity talking portraits with detailed facial textures and precise lip synchronization in real-time, outperforming state-of-the-art methods.

FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis

TL;DR

FREAK addresses the challenge of high-fidelity lip synchronization in audio-driven talking portraits by introducing frequency-domain learning. It leverages two novel modules, Visual Encoding Frequency Modulator (VEFM) and Audio Visual Frequency Modulator (AVFM), to modulate visual features and audio-visual interactions in the Fourier domain, while optimizing in both the pixel and frequency spaces with , , and . The approach supports both one-shot generation and video dubbing, delivering real-time, high-resolution results that surpass state-of-the-art methods in fidelity and lip-sync, as demonstrated by comprehensive quantitative, qualitative, and user studies. This frequency-domain perspective offers a new direction for talking-head synthesis with practical impact for digital avatars and media production, while prompting attention to potential misuse and the need for deepfake detection tools.

Abstract

Achieving high-fidelity lip-speech synchronization in audio-driven talking portrait synthesis remains challenging. While multi-stage pipelines or diffusion models yield high-quality results, they suffer from high computational costs. Some approaches perform well on specific individuals with low resources, yet still exhibit mismatched lip movements. The aforementioned methods are modeled in the pixel domain. We observed that there are noticeable discrepancies in the frequency domain between the synthesized talking videos and natural videos. Currently, no research on talking portrait synthesis has considered this aspect. To address this, we propose a FREquency-modulated, high-fidelity, and real-time Audio-driven talKing portrait synthesis framework, named FREAK, which models talking portraits from the frequency domain perspective, enhancing the fidelity and naturalness of the synthesized portraits. FREAK introduces two novel frequency-based modules: 1) the Visual Encoding Frequency Modulator (VEFM) to couple multi-scale visual features in the frequency domain, better preserving visual frequency information and reducing the gap in the frequency spectrum between synthesized and natural frames. and 2) the Audio Visual Frequency Modulator (AVFM) to help the model learn the talking pattern in the frequency domain and improve audio-visual synchronization. Additionally, we optimize the model in both pixel domain and frequency domain jointly. Furthermore, FREAK supports seamless switching between one-shot and video dubbing settings, offering enhanced flexibility. Due to its superior performance, it can simultaneously support high-resolution video results and real-time inference. Extensive experiments demonstrate that our method synthesizes high-fidelity talking portraits with detailed facial textures and precise lip synchronization in real-time, outperforming state-of-the-art methods.

Paper Structure

This paper contains 20 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Frequency analysis on natural and synthetic videos. The leftmost column shows the averaged FFT spectrum of two distinct natural videos, while the right three columns (from left to right) display the averaged FFT spectra of corresponding synthetic videos generated by RAD-NeRF radnerf, DINet dinet, and SadTalker sadtalker, respectively. The two rows represent data from talking videos of two distinct identities.
  • Figure 2: The framework of our method. The model takes a single reference image, an audio clip, and a masked image (which can be either a half mask or a full mask) as inputs, and produces a frame that is synchronized with the driving audio. The lower half of the figure presents the detailed structure of the two frequency-domain modulation modules.
  • Figure 3: Qualitative Comparisons with State-of-the-Art Methods. Three identities, each with different speech content, are compared against Wav2Lip, DINet, IP-LAP, MuseTalk, RAD-NeRF and SadTalker. The results of SadTalker are used for comparison with those obtained by our method in the one-shot setting.
  • Figure 4: Results without both VEFM & AVFM in the one-shot setting. The consistency of the head structure between frames is difficult to maintain, with noticeable jitter between frames. Some frames exhibit ghosting or distortion.
  • Figure 5: The impact of the context length of hubert features on the effect.