Table of Contents
Fetching ...

IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

Bo Chen, Tao Liu, Qi Chen, Xie Chen, Zilong Zheng

TL;DR

<3-5 sentence high-level summary> IMTalker tackles the challenge of efficient, high-fidelity talking-face generation by replacing explicit optical-flow warping with an implicit motion transfer approach that leverages cross-attention in a unified latent space. A lightweight Identity-Adaptive module disentangles motion from identity, while a Flow-Matching Motion Generator produces controllable motion latents from audio, pose, and gaze cues. The method achieves state-of-the-art motion accuracy, identity preservation, and audio–lip synchronization with real-time performance (40 FPS video-driven, 42 FPS audio-driven on an RTX 4090). Extensive experiments on HDTF, CelebV, and cross-reenactment tasks demonstrate robust handling of large pose variations and diverse identities, with code and pretrained models released for broad adoption.

Abstract

Talking face generation aims to synthesize realistic speaking portraits from a single image, yet existing methods often rely on explicit optical flow and local warping, which fail to model complex global motions and cause identity drift. We present IMTalker, a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer. The core idea is to replace traditional flow-based warping with a cross-attention mechanism that implicitly models motion discrepancy and identity alignment within a unified latent space, enabling robust global motion rendering. To further preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module that projects motion latents into personalized spaces, ensuring clear disentanglement between motion and identity. In addition, a lightweight flow-matching motion generator produces vivid and controllable implicit motion vectors from audio, pose, and gaze cues. Extensive experiments demonstrate that IMTalker surpasses prior methods in motion accuracy, identity preservation, and audio-lip synchronization, achieving state-of-the-art quality with superior efficiency, operating at 40 FPS for video-driven and 42 FPS for audio-driven generation on an RTX 4090 GPU. We will release our code and pre-trained models to facilitate applications and future research.

IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

TL;DR

<3-5 sentence high-level summary> IMTalker tackles the challenge of efficient, high-fidelity talking-face generation by replacing explicit optical-flow warping with an implicit motion transfer approach that leverages cross-attention in a unified latent space. A lightweight Identity-Adaptive module disentangles motion from identity, while a Flow-Matching Motion Generator produces controllable motion latents from audio, pose, and gaze cues. The method achieves state-of-the-art motion accuracy, identity preservation, and audio–lip synchronization with real-time performance (40 FPS video-driven, 42 FPS audio-driven on an RTX 4090). Extensive experiments on HDTF, CelebV, and cross-reenactment tasks demonstrate robust handling of large pose variations and diverse identities, with code and pretrained models released for broad adoption.

Abstract

Talking face generation aims to synthesize realistic speaking portraits from a single image, yet existing methods often rely on explicit optical flow and local warping, which fail to model complex global motions and cause identity drift. We present IMTalker, a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer. The core idea is to replace traditional flow-based warping with a cross-attention mechanism that implicitly models motion discrepancy and identity alignment within a unified latent space, enabling robust global motion rendering. To further preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module that projects motion latents into personalized spaces, ensuring clear disentanglement between motion and identity. In addition, a lightweight flow-matching motion generator produces vivid and controllable implicit motion vectors from audio, pose, and gaze cues. Extensive experiments demonstrate that IMTalker surpasses prior methods in motion accuracy, identity preservation, and audio-lip synchronization, achieving state-of-the-art quality with superior efficiency, operating at 40 FPS for video-driven and 42 FPS for audio-driven generation on an RTX 4090 GPU. We will release our code and pre-trained models to facilitate applications and future research.

Paper Structure

This paper contains 29 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: IMTalker accepts diverse portrait styles and achieves 40 FPS for video-driven and 42 FPS for audio-driven talking-face generation when tested on an NVIDIA RTX 4090 GPU at $512\times512$ resolution. It also enables diverse controllability by allowing precise head-pose and eye-gaze inputs alongside audio.
  • Figure 2: Overall architecture of our proposed framework. Given a source image, an Identity Encoder and Motion Encoder extract identity features and source motion. A driving motion latent is either extracted from video by the Motion Encoder or synthesized from audio by a Flow-Matching Motion Generator. Both motion latents are then personalized by the Identity-Adaptive Module. Subsequently, the Implicit Motion Transfer Module uses these personalized latents and $f_{id}$ to generate aligned features via a motion decoder and cross-attention. Finally, the Synthesis Network renders these aligned features into the final photorealistic image.
  • Figure 3: Qualitative comparison on cross-reenactment. Our method (IMTalker) achieves superior identity preservation and visual realism compared to existing baselines. Please refer to our supplementary material for more detailed comparison.
  • Figure 4: Qualitative comparison of audio-driven talking head generation results. The Source column (center) shows the single source identity used for all methods on HDTFzhang2021flow (top image, left block) and CelebVyu2023celebv (bottom image, right block). Each row corresponds to one method's results on different test samples. Please refer to our supplementary material for more detailed comparison.
  • Figure 5: Ablation study on Identity-Adaptive module From left to right: driving frame, source images, results of our model without identity-adaptive module, and the full model results.