Table of Contents
Fetching ...

MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice

Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, Wei Li, Michael Lingelbach, Daquan Zhou

TL;DR

MagicInfinite tackles infinite talking video generation from a single portrait conditioned on audio and text by combining a diffusion Transformer with 3D full-attention and a sliding window denoising strategy to ensure temporal coherence. It introduces a two-stage curriculum learning scheme that jointly leverages global textual guidance and local audio-driven lip dynamics, aided by a face-region mask and an adaptive loss to balance modalities. Efficient long-video inference is achieved through collaborative distillation (DMD2) and CFG strategies, with memory-efficient LoRA integration and sliding-window sampling, yielding about a $20\times$ speedup over the baseline. Evaluations on the MagicInfinite-Benchmark demonstrate robust lip-sync, identity preservation, and motion naturalness across varied portraits and driving scenarios, making the approach practical for scalable, real-time-like avatar generation.

Abstract

We present MagicInfinite, a novel diffusion Transformer (DiT) framework that overcomes traditional portrait animation limitations, delivering high-fidelity results across diverse character types-realistic humans, full-body figures, and stylized anime characters. It supports varied facial poses, including back-facing views, and animates single or multiple characters with input masks for precise speaker designation in multi-character scenes. Our approach tackles key challenges with three innovations: (1) 3D full-attention mechanisms with a sliding window denoising strategy, enabling infinite video generation with temporal coherence and visual quality across diverse character styles; (2) a two-stage curriculum learning scheme, integrating audio for lip sync, text for expressive dynamics, and reference images for identity preservation, enabling flexible multi-modal control over long sequences; and (3) region-specific masks with adaptive loss functions to balance global textual control and local audio guidance, supporting speaker-specific animations. Efficiency is enhanced via our innovative unified step and cfg distillation techniques, achieving a 20x inference speed boost over the basemodel: generating a 10 second 540x540p video in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs, without quality loss. Evaluations on our new benchmark demonstrate MagicInfinite's superiority in audio-lip synchronization, identity preservation, and motion naturalness across diverse scenarios. It is publicly available at https://www.hedra.com/, with examples at https://magicinfinite.github.io/.

MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice

TL;DR

MagicInfinite tackles infinite talking video generation from a single portrait conditioned on audio and text by combining a diffusion Transformer with 3D full-attention and a sliding window denoising strategy to ensure temporal coherence. It introduces a two-stage curriculum learning scheme that jointly leverages global textual guidance and local audio-driven lip dynamics, aided by a face-region mask and an adaptive loss to balance modalities. Efficient long-video inference is achieved through collaborative distillation (DMD2) and CFG strategies, with memory-efficient LoRA integration and sliding-window sampling, yielding about a speedup over the baseline. Evaluations on the MagicInfinite-Benchmark demonstrate robust lip-sync, identity preservation, and motion naturalness across varied portraits and driving scenarios, making the approach practical for scalable, real-time-like avatar generation.

Abstract

We present MagicInfinite, a novel diffusion Transformer (DiT) framework that overcomes traditional portrait animation limitations, delivering high-fidelity results across diverse character types-realistic humans, full-body figures, and stylized anime characters. It supports varied facial poses, including back-facing views, and animates single or multiple characters with input masks for precise speaker designation in multi-character scenes. Our approach tackles key challenges with three innovations: (1) 3D full-attention mechanisms with a sliding window denoising strategy, enabling infinite video generation with temporal coherence and visual quality across diverse character styles; (2) a two-stage curriculum learning scheme, integrating audio for lip sync, text for expressive dynamics, and reference images for identity preservation, enabling flexible multi-modal control over long sequences; and (3) region-specific masks with adaptive loss functions to balance global textual control and local audio guidance, supporting speaker-specific animations. Efficiency is enhanced via our innovative unified step and cfg distillation techniques, achieving a 20x inference speed boost over the basemodel: generating a 10 second 540x540p video in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs, without quality loss. Evaluations on our new benchmark demonstrate MagicInfinite's superiority in audio-lip synchronization, identity preservation, and motion naturalness across diverse scenarios. It is publicly available at https://www.hedra.com/, with examples at https://magicinfinite.github.io/.

Paper Structure

This paper contains 21 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Given a portrait image, our model can generate compelling, realistic, and vivid animation videos with control over text and voice, ensuring temporal coherence and perceptual quality even under significant head pose variations and diverse portrait styles.
  • Figure 2: Overview of MagicInfinite. MagicInfinite employs a hybrid dual-to-single-stream denoising network with Audio Cross-Attention in final blocks. MLLM encodes static portrait and text into tokens, concatenated for T2V, refined, and denoised. Wav2Vec encodes audio, resampled by an Audio Encoder, and guided by a Face Region Mask for precise lip sync and adaptive loss.
  • Figure 3: The overview of our modified DMD2. We employed a curriculum learning strategy to gradually reduce the weight of the base loss while progressively increasing the weight of the SDS loss, effectively avoiding abrupt shifts in learning objectives. Furthermore, we adopted a two-fold to three-fold CFG attenuation strategy in the calculation of the real data distribution, which significantly enhances the motion dynamics of the generated video.
  • Figure 4: Qualitative experimental results of MagicInfinite