Table of Contents
Fetching ...

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, Shuicheng Yan

TL;DR

<3-5 sentence high-level summary> MEMO tackles audio-driven talking video generation by addressing long-term identity preservation and natural expression alignment with audio. It introduces a memory-guided temporal module using linear attention and a memory decay mechanism to leverage extended past context, along with an emotion-aware diffusion module that uses multi-modal attention and emotion-adaptive normalization. A two-stage training regime and a dedicated data pipeline ensure high-quality, emotion-disentangled training data and robust performance. Empirical results on out-of-distribution datasets show MEMO outperforms state-of-the-art methods in video quality, lip-sync, identity consistency, and expression-emotion alignment, with strong generalization to multilingual audio and diverse reference images.

Abstract

Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

TL;DR

<3-5 sentence high-level summary> MEMO tackles audio-driven talking video generation by addressing long-term identity preservation and natural expression alignment with audio. It introduces a memory-guided temporal module using linear attention and a memory decay mechanism to leverage extended past context, along with an emotion-aware diffusion module that uses multi-modal attention and emotion-adaptive normalization. A two-stage training regime and a dedicated data pipeline ensure high-quality, emotion-disentangled training data and robust performance. Empirical results on out-of-distribution datasets show MEMO outperforms state-of-the-art methods in video quality, lip-sync, identity consistency, and expression-emotion alignment, with strong generalization to multilingual audio and diverse reference images.

Abstract

Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.

Paper Structure

This paper contains 35 sections, 6 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Our MEMO generates talking videos with improved identity consistency, audio-lip alignment, and motion smoothness. In contrast, existing diffusion methods (e.g., Hallo2 cui2024hallo2) are prone to temporal error accumulation during autoregressive generation, especially when the last 2-4 generated frames used as temporal conditions contain artifacts, leading to inconsistent identity. Please refer to the supplementary material for video demos.
  • Figure 2: Overview of MEMO, which is structured with a Reference Net and a Diffusion Net. The core innovations of MEMO reside in two key modules within the Diffusion Net: the memory-guided temporal module and the emotion-aware audio module. These modules work in tandem to deliver enhanced audio-video synchronization, sustained identity consistency, and more natural expression generation.
  • Figure 3: Memory-guided temporal module.
  • Figure 4: Emotion-aware audio module.
  • Figure 5: Human preferences among MEMO and baselines, where users select the best method in terms of each evaluation metric.
  • ...and 14 more figures