Table of Contents
Fetching ...

InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing

Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, Xiaoming Wei

TL;DR

<3-5 sentence high-level summary>

Abstract

Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.

InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing

TL;DR

<3-5 sentence high-level summary>

Abstract

Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.

Paper Structure

This paper contains 22 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Compared to the traditional paradigm, sparse-frame video dubbing will not only edit mouth regions. It gives the model freedom to generate audio aligned mouth, facial, and body movements while referencing on sparse keyframes to preserve identity, emotional cadence, and iconic gestures.
  • Figure 2: (left): I2V model accumulates error for long video sequences. (right): A new chunk starts from frame 82. FL2V model suffers from abrupt inter-chunk transitions.
  • Figure 3: A visual comparison between the training reference positioning strategies. All video chunks are generated using the same context frames and the same reference frame shown in below.
  • Figure 4: Visualization of InfiniteTalk pipeline. Left: The streaming model receives a audio, a reference frame, and context frames to denoise iteratively. Right: The architecture of the diffusion transformer. In addition to the traditional structures, each block includes an audio cross-attention layer and a reference cross-attention layer
  • Figure 5: Visualization of reference frame conditioning strategies for video dubbing models. Top four rows: conditioning on input video frames. Bottom row: conditioning on generated video frames. Left: Image-to-video dubbing model with initial frame conditioning (I2V) and initial+terminal frame conditioning (IT2V). Right: Streaming dubbing model with four conditioning strategies. Within each category (left/right), all strategies share identical generated-video conditioning approaches.
  • ...and 2 more figures