Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance

Haojie Zhang; Zhihao Liang; Ruibo Fu; Bingyan Liu; Zhengqi Wen; Xuefei Liu; Jianhua Tao; Yaling Liang

Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance

Haojie Zhang, Zhihao Liang, Ruibo Fu, Bingyan Liu, Zhengqi Wen, Xuefei Liu, Jianhua Tao, Yaling Liang

TL;DR

Long-duration talking head synthesis suffers from quality degradation, identity drift, and inefficient generation. The authors present LetsTalk, a diffusion-transformer framework that uses a noise-regularized memory bank, a deep compression autoencoder, and linear attention to preserve temporal coherence and scalability under multimodal guidance. They systematically analyze multimodal fusion schemes and find that Symbiotic Fusion for portraits combined with Direct Fusion for audio yields strong identity preservation and synchronized motion while maintaining diversity. Empirical results on HDTF and CelebV-HQ demonstrate state-of-the-art realism and temporal consistency with significantly fewer parameters (8× fewer) and improved efficiency, enabling practical, scalable digital-human applications.

Abstract

Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait and temporal consistency, and computational efficiency. As video length increases, issues such as visual degradation, identity inconsistency, temporal incoherence, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results. To address these challenges, we present LetsTalk, a diffusion transformer framework equipped with multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient generation of long-duration talking videos. In particular, LetsTalk introduces a noise-regularized memory bank to alleviate error accumulation and sampling artifacts during extended video generation. To further improve efficiency and spatiotemporal consistency, LetsTalk employs a deep compression autoencoder and a spatiotemporal-aware transformer with linear attention for effective multimodal fusion. We systematically analyze three fusion schemes and show that combining deep (Symbiotic Fusion) for portrait features and shallow (Direct Fusion) for audio achieves superior visual realism and precise speech-driven motion, while preserving diversity of movements. Extensive experiments demonstrate that LetsTalk establishes new state-of-the-art in generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, and maintains remarkable efficiency with 8x fewer parameters than previous approaches.

Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance

TL;DR

Abstract

Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)