Table of Contents
Fetching ...

See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

Jinting Wang, Jun Wang, Hei Victor Cheng, Li Liu

TL;DR

The paper addresses creating high-resolution talking-face videos from a single speech input, a task challenged by identity preservation and realistic lip-sync without reference portraits. It introduces a two-stage framework: SCFP, which uses a speech-conditioned latent diffusion model guided by a statistical face prior and a sample-adaptive weighting module to produce a consistent speaker portrait, and HRTF, which models holistic motion in a latent space and refines lip movements before rendering, aided by a transformer-based discrete codebook for high-fidelity frames. Key innovations include ConRe pre-training for cross-modal alignment, SAW to adapt priors to speech, a lip-region refinement module, and end-to-end high-resolution rendering with a learned codebook, enabling high-quality outputs on HDTF, VoxCeleb, and AVSpeech. The approach outperforms state-of-the-art methods on multiple benchmarks and demonstrates robust identity preservation, lip synchronization, and visual fidelity, suggesting strong practical impact for realistic virtual personas from plain audio inputs.

Abstract

Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.

See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

TL;DR

The paper addresses creating high-resolution talking-face videos from a single speech input, a task challenged by identity preservation and realistic lip-sync without reference portraits. It introduces a two-stage framework: SCFP, which uses a speech-conditioned latent diffusion model guided by a statistical face prior and a sample-adaptive weighting module to produce a consistent speaker portrait, and HRTF, which models holistic motion in a latent space and refines lip movements before rendering, aided by a transformer-based discrete codebook for high-fidelity frames. Key innovations include ConRe pre-training for cross-modal alignment, SAW to adapt priors to speech, a lip-region refinement module, and end-to-end high-resolution rendering with a learned codebook, enabling high-quality outputs on HDTF, VoxCeleb, and AVSpeech. The approach outperforms state-of-the-art methods on multiple benchmarks and demonstrates robust identity preservation, lip synchronization, and visual fidelity, suggesting strong practical impact for realistic virtual personas from plain audio inputs.

Abstract

Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.

Paper Structure

This paper contains 31 sections, 20 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Our framework enables high-resolution talking face video generation from a single audio speech. Firstly, identity information is disentangled to synthesize a speaker's face portrait, followed by the generation of talking videos that align with the decoupled motion cues, all while maintaining identity consistency throughout the video. Notably, for aesthetic purposes and to ensure a fair comparison, we edit the generated face portraits by adding audio-unrelated attributes, such as hair, clothing, and background, etc.
  • Figure 2: Overview of the proposed two-stage high-resolution talking face generation framework: (1) Stage 1: Speech-Conditioned Portrait Generation with Face Prior Guidance (SCFP). In this stage, portrait diffusion $P_{diff}$ is trained to capture the personalized speech-portrait correlation using statistical face prior guidance. To emphasize the individual variance conditioned on the speech, we design a Sample-Adaptive Weighted (SAW) module that adaptively adjusts the face prior weight on the noise input. (2) Stage 2: High-Resolution Talking Face Synthesis with Holistic Motion and Lip Region Refinement (HRTF). Based on the speech condition, we develop a motion diffusion $M_{diff}$, to capture the holistic motion representation, including both facial dynamics and head movement, in the latent space. Subsequently, a motion wrapping module and a high-resolution decoder render the learned motion into high-resolution talking face videos, preserving both the static and dynamic visual attributes of the target identity.
  • Figure 3: Qualitative comparison of speech-conditioned portrait generation without or with statistical face prior guidance. (a) Ground truth cropped from the video frame; (b) Top-3 generated results of the same speech condition without face prior guidance; (c) Top-3 generated results of the same speech condition with sample-equivalent weighted ($\beta^0$) face prior guidance; (d) Top-3 generated results of the same speech condition with sample-adaptive weighted ($\beta$) face prior guidance. Diversity refers to the variance among the generated results of different sample noise with the same speech condition, while consistency denotes the preservation of identity in generated results compared to the ground truth.
  • Figure 4: The details of proposed Sample-adaptive weighted module (SAW).
  • Figure 5: The details of holistic motion construction and wrapping. We train identity encoder $\varnothing_{id}$, motion encoder $\varnothing_{m}$, Motion Wrapper, and HR Decoder to learn holistic motion representation and motion wrapping.
  • ...and 10 more figures