See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement
Jinting Wang, Jun Wang, Hei Victor Cheng, Li Liu
TL;DR
The paper addresses creating high-resolution talking-face videos from a single speech input, a task challenged by identity preservation and realistic lip-sync without reference portraits. It introduces a two-stage framework: SCFP, which uses a speech-conditioned latent diffusion model guided by a statistical face prior and a sample-adaptive weighting module to produce a consistent speaker portrait, and HRTF, which models holistic motion in a latent space and refines lip movements before rendering, aided by a transformer-based discrete codebook for high-fidelity frames. Key innovations include ConRe pre-training for cross-modal alignment, SAW to adapt priors to speech, a lip-region refinement module, and end-to-end high-resolution rendering with a learned codebook, enabling high-quality outputs on HDTF, VoxCeleb, and AVSpeech. The approach outperforms state-of-the-art methods on multiple benchmarks and demonstrates robust identity preservation, lip synchronization, and visual fidelity, suggesting strong practical impact for realistic virtual personas from plain audio inputs.
Abstract
Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.
