FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis

Ziqi Ni; Ao Fu; Yi Zhou

FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis

Ziqi Ni, Ao Fu, Yi Zhou

TL;DR

FREAK addresses the challenge of high-fidelity lip synchronization in audio-driven talking portraits by introducing frequency-domain learning. It leverages two novel modules, Visual Encoding Frequency Modulator (VEFM) and Audio Visual Frequency Modulator (AVFM), to modulate visual features and audio-visual interactions in the Fourier domain, while optimizing in both the pixel and frequency spaces with $L_{rec}$, $L_{percep}$, and $L_{freq}$. The approach supports both one-shot generation and video dubbing, delivering real-time, high-resolution results that surpass state-of-the-art methods in fidelity and lip-sync, as demonstrated by comprehensive quantitative, qualitative, and user studies. This frequency-domain perspective offers a new direction for talking-head synthesis with practical impact for digital avatars and media production, while prompting attention to potential misuse and the need for deepfake detection tools.

Abstract

Achieving high-fidelity lip-speech synchronization in audio-driven talking portrait synthesis remains challenging. While multi-stage pipelines or diffusion models yield high-quality results, they suffer from high computational costs. Some approaches perform well on specific individuals with low resources, yet still exhibit mismatched lip movements. The aforementioned methods are modeled in the pixel domain. We observed that there are noticeable discrepancies in the frequency domain between the synthesized talking videos and natural videos. Currently, no research on talking portrait synthesis has considered this aspect. To address this, we propose a FREquency-modulated, high-fidelity, and real-time Audio-driven talKing portrait synthesis framework, named FREAK, which models talking portraits from the frequency domain perspective, enhancing the fidelity and naturalness of the synthesized portraits. FREAK introduces two novel frequency-based modules: 1) the Visual Encoding Frequency Modulator (VEFM) to couple multi-scale visual features in the frequency domain, better preserving visual frequency information and reducing the gap in the frequency spectrum between synthesized and natural frames. and 2) the Audio Visual Frequency Modulator (AVFM) to help the model learn the talking pattern in the frequency domain and improve audio-visual synchronization. Additionally, we optimize the model in both pixel domain and frequency domain jointly. Furthermore, FREAK supports seamless switching between one-shot and video dubbing settings, offering enhanced flexibility. Due to its superior performance, it can simultaneously support high-resolution video results and real-time inference. Extensive experiments demonstrate that our method synthesizes high-fidelity talking portraits with detailed facial textures and precise lip synchronization in real-time, outperforming state-of-the-art methods.

FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis

TL;DR

Abstract

FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)