Table of Contents
Fetching ...

PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

Fatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia, Muhammad Awais, Josef Kittler

TL;DR

PortraitTalk introduces a customizable one-shot audio-to-talking-face framework built on a latent diffusion backbone, comprising IdentityNet for identity preservation and text-driven editing, and AnimateNet for motion generation with structure, identity, and temporal cross-attention. It integrates audio, visual, and textual cues via decoupled cross-attention and employs a mask reconstruction loss to strengthen global facial coherence, while a two-stage training regime ensures robust generalization to unseen identities. A novel Audio-Driven Facial Dynamics (ADFD) score jointly evaluates spatial and temporal facial dynamics aligned with audio, enabling holistic assessment. Empirical results on HDTF and MEAD show PortraitTalk surpassing state-of-the-art methods in visual fidelity, lip-sync accuracy, and identity consistency, with flexible prompt-based customization and multi-reference support. The approach advances real-world applicability for customizable, expressive talking-face content without identity retraining, while acknowledging limitations under intense emotional expressions and potential style-related artifacts.

Abstract

Audio-driven talking face generation is a challenging task in digital communication. Despite significant progress in the area, most existing methods concentrate on audio-lip synchronization, often overlooking aspects such as visual quality, customization, and generalization that are crucial to producing realistic talking faces. To address these limitations, we introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk. Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet. IdentityNet is designed to preserve identity features consistently across the generated video frames, while AnimateNet aims to enhance temporal coherence and motion consistency. This framework also integrates an audio input with the reference images, thereby reducing the reliance on reference-style videos prevalent in existing approaches. A key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms, which significantly expands creative control over the generated videos. Through extensive experiments, including a newly developed evaluation metric, our model demonstrates superior performance over the state-of-the-art methods, setting a new standard for the generation of customizable realistic talking faces suitable for real-world applications.

PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

TL;DR

PortraitTalk introduces a customizable one-shot audio-to-talking-face framework built on a latent diffusion backbone, comprising IdentityNet for identity preservation and text-driven editing, and AnimateNet for motion generation with structure, identity, and temporal cross-attention. It integrates audio, visual, and textual cues via decoupled cross-attention and employs a mask reconstruction loss to strengthen global facial coherence, while a two-stage training regime ensures robust generalization to unseen identities. A novel Audio-Driven Facial Dynamics (ADFD) score jointly evaluates spatial and temporal facial dynamics aligned with audio, enabling holistic assessment. Empirical results on HDTF and MEAD show PortraitTalk surpassing state-of-the-art methods in visual fidelity, lip-sync accuracy, and identity consistency, with flexible prompt-based customization and multi-reference support. The approach advances real-world applicability for customizable, expressive talking-face content without identity retraining, while acknowledging limitations under intense emotional expressions and potential style-related artifacts.

Abstract

Audio-driven talking face generation is a challenging task in digital communication. Despite significant progress in the area, most existing methods concentrate on audio-lip synchronization, often overlooking aspects such as visual quality, customization, and generalization that are crucial to producing realistic talking faces. To address these limitations, we introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk. Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet. IdentityNet is designed to preserve identity features consistently across the generated video frames, while AnimateNet aims to enhance temporal coherence and motion consistency. This framework also integrates an audio input with the reference images, thereby reducing the reliance on reference-style videos prevalent in existing approaches. A key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms, which significantly expands creative control over the generated videos. Through extensive experiments, including a newly developed evaluation metric, our model demonstrates superior performance over the state-of-the-art methods, setting a new standard for the generation of customizable realistic talking faces suitable for real-world applications.

Paper Structure

This paper contains 19 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Talking faces generated by PortraitTalk. Given the reference images of an identity and the corresponding audio input, PortraitTalk synthesizes high-quality talking face videos that closely preserve the identity's appearance and speaking style. Furthermore, the visual attributes of the generated video, such as hair color, age, environmental settings and facial expressions, can be flexibly customized using a simple text prompt. This enables fine-grained control over the generated content, allowing users to tailor the visual presentation to specific narrative or stylistic requirements.
  • Figure 2: PortraitTalk has two main components: IdentityNet and AnimateNet. Text and identity embeddings are derived from the text and face encoder, with a projection layer mapping identity features into the text embedding dimension. These features are integrated into IdentityNet using a decoupled cross-attention mechanism to capture subtle facial characteristics. Simultaneously, facial motions corresponding to the input speech, enhanced by head placement guidance, are processed through AnimateNet to ensure dynamic and temporal coherence. In PortraitTalk, a latent diffusion model serves as the foundational rendering mechanism. The structural attention block incorporates head placement guidance and facial landmark mapping. For simplicity, these elements are represented within a single block.
  • Figure 3: An overview of the masked loss fine-tuning strategy used for IdentityNet. During training, random regions of the input frames are corrupted, and the model is optimized to reconstruct the original content. This masked fine-tuning approach encourages the network to focus on the global facial structure and identity-relevant features, rather than overfitting to local pixel-level details. The process operates in the latent space of a diffusion model: the masked input is first encoded, followed by a forward and backward diffusion process over multiple time steps ($T_{steps}$). The denoising U-Net is trained to estimate the added noise using a mask reconstruction loss ($L_{mask}$), which guides IdentityNet toward producing more consistent, stable, and identity-preserving generations.
  • Figure 4: Qualitative comparison with the existing talking face generation methods. The results demonstrate that PortraitTalk surpasses previous methods in audio-lip alignment, identity resemblance, and expressiveness. Please note that the methods in the dash box use external emotion labels or reference videos to generate expressive videos.
  • Figure 5: Qualitative comparison of ablated variants of PortraitTalk, illustrating the effect of different model components. Each column shows output frames generated by models with specific components removed or added. These visualizations demonstrate the contribution of each design choice to the final visual quality and identity consistency of the generated talking head videos. This ablation study highlights how each component contributes to enhancing realism, expressiveness, and alignment with user intent.
  • ...and 4 more figures