Table of Contents
Fetching ...

Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance

Haojie Zhang, Zhihao Liang, Ruibo Fu, Bingyan Liu, Zhengqi Wen, Xuefei Liu, Jianhua Tao, Yaling Liang

TL;DR

Long-duration talking head synthesis suffers from quality degradation, identity drift, and inefficient generation. The authors present LetsTalk, a diffusion-transformer framework that uses a noise-regularized memory bank, a deep compression autoencoder, and linear attention to preserve temporal coherence and scalability under multimodal guidance. They systematically analyze multimodal fusion schemes and find that Symbiotic Fusion for portraits combined with Direct Fusion for audio yields strong identity preservation and synchronized motion while maintaining diversity. Empirical results on HDTF and CelebV-HQ demonstrate state-of-the-art realism and temporal consistency with significantly fewer parameters (8× fewer) and improved efficiency, enabling practical, scalable digital-human applications.

Abstract

Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait and temporal consistency, and computational efficiency. As video length increases, issues such as visual degradation, identity inconsistency, temporal incoherence, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results. To address these challenges, we present LetsTalk, a diffusion transformer framework equipped with multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient generation of long-duration talking videos. In particular, LetsTalk introduces a noise-regularized memory bank to alleviate error accumulation and sampling artifacts during extended video generation. To further improve efficiency and spatiotemporal consistency, LetsTalk employs a deep compression autoencoder and a spatiotemporal-aware transformer with linear attention for effective multimodal fusion. We systematically analyze three fusion schemes and show that combining deep (Symbiotic Fusion) for portrait features and shallow (Direct Fusion) for audio achieves superior visual realism and precise speech-driven motion, while preserving diversity of movements. Extensive experiments demonstrate that LetsTalk establishes new state-of-the-art in generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, and maintains remarkable efficiency with 8x fewer parameters than previous approaches.

Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance

TL;DR

Long-duration talking head synthesis suffers from quality degradation, identity drift, and inefficient generation. The authors present LetsTalk, a diffusion-transformer framework that uses a noise-regularized memory bank, a deep compression autoencoder, and linear attention to preserve temporal coherence and scalability under multimodal guidance. They systematically analyze multimodal fusion schemes and find that Symbiotic Fusion for portraits combined with Direct Fusion for audio yields strong identity preservation and synchronized motion while maintaining diversity. Empirical results on HDTF and CelebV-HQ demonstrate state-of-the-art realism and temporal consistency with significantly fewer parameters (8× fewer) and improved efficiency, enabling practical, scalable digital-human applications.

Abstract

Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait and temporal consistency, and computational efficiency. As video length increases, issues such as visual degradation, identity inconsistency, temporal incoherence, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results. To address these challenges, we present LetsTalk, a diffusion transformer framework equipped with multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient generation of long-duration talking videos. In particular, LetsTalk introduces a noise-regularized memory bank to alleviate error accumulation and sampling artifacts during extended video generation. To further improve efficiency and spatiotemporal consistency, LetsTalk employs a deep compression autoencoder and a spatiotemporal-aware transformer with linear attention for effective multimodal fusion. We systematically analyze three fusion schemes and show that combining deep (Symbiotic Fusion) for portrait features and shallow (Direct Fusion) for audio achieves superior visual realism and precise speech-driven motion, while preserving diversity of movements. Extensive experiments demonstrate that LetsTalk establishes new state-of-the-art in generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, and maintains remarkable efficiency with 8x fewer parameters than previous approaches.

Paper Structure

This paper contains 20 sections, 1 equation, 8 figures, 7 tables.

Figures (8)

  • Figure 1: We introduce LetsTalk, a diffusion-based transformer for audio-driven portrait animation. Given a reference image and audio, LetsTalk generates realistic videos with synchronized mouth motions. As shown in the Left figure, each column corresponds to the same audio, demonstrating consistent and accurate lip movements. The Right figure compares generation quality and inference time on the HDTF dataset, where circle size represents model parameters. LetsTalk achieves superior quality and efficiency, outperforming methods like Hallo and AniPortrait. Notably, our base version (LetsTalk-B) matches Hallo's performance with 8$\times$ fewer parameters.
  • Figure 2: Overview of our LetsTalk framework for robust long-duration talking head video generation. Our system combines a deep compression autoencoder to reduce spatial redundancy while preserving temporal features, and transformer blocks with intertwined temporal and spatial attention to effectively capture both intra-frame details and long-range dependencies. Portrait and audio embeddings are extracted; Symbiotic Fusion integrates the portrait embedding, and Direct Fusion incorporates the audio embedding, providing effective multimodal guidance for video synthesis. Portrait embeddings are repeated along the temporal axis for consistent conditioning across frames. To further support long-sequence generation, a memory bank module is introduced to maintain temporal consistency, while a dedicated noise-regularized training strategy helps align the memory bank usage between training and inference stages, ensuring stable and high-fidelity generation.
  • Figure 3: The illustration of the long-duration generation.
  • Figure 4: Multimodal fusion schemes: (a) Direct Fusion injects conditions via cross-attention modules; (b) Siamese Fusion uses parallel transformer for feature guidance; (c) Symbiotic Fusion achieves fusion through input concatenation and self-attention. The backbone architecture (left-side blocks) remains consistent across all approaches.
  • Figure 5: The qualitative comparisons with other cutting-edge methods on the HDTF dataset. Our method achieves better audio-animation alignment (e.g. lip motions) and produces expressive results.
  • ...and 3 more figures