Table of Contents
Fetching ...

MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation

Siyuan Wang, Jiawei Liu, Wei Wang, Yeying Jin, Jinsong Du, Zhi Han

TL;DR

Co-speech gesture video generation from audio-driven stills is challenging due to varying regional motions and the need for synchronized lip movements. The paper introduces MMGT, a two-stage framework that uses audio-derived motion masks and motion features to jointly drive gesture video synthesis without relying on extra priors. Stage I (SMGA) generates a pose video $\hat{V}_P$ and a motion mask $\hat{V}_M$ from audio $a$ and an initial pose, while Stage II embeds Motion Masked Hierarchical Audio Attention (MM-HAA) inside a Stabilized Diffusion framework to produce a high-quality final video $\hat{V}$ with accurate texture and regional details. Experiments show improved video quality, lip-sync, and gesture realism compared with prior methods, demonstrating practical gains from using only audio and a single reference image.

Abstract

Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of different parts of the body in terms of amplitude of motion, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in video, leading to more pronounced artifacts and distortions. Existing approaches typically address this issue by introducing additional a priori information, but this can limit the practical application of the task. Specifically, we propose a Motion Mask-Guided Two-Stage Network (MMGT) that uses audio, as well as motion masks and motion features generated from the audio signal to jointly drive the generation of synchronized speech gesture videos. In the first stage, the Spatial Mask-Guided Audio Pose Generation (SMGA) Network generates high-quality pose videos and motion masks from audio, effectively capturing large movements in key regions such as the face and gestures. In the second stage, we integrate the Motion Masked Hierarchical Audio Attention (MM-HAA) into the Stabilized Diffusion Video Generation model, overcoming limitations in fine-grained motion generation and region-specific detail control found in traditional methods. This guarantees high-quality, detailed upper-body video generation with accurate texture and motion details. Evaluations show improved video quality, lip-sync, and gesture. The model and code are available at https://github.com/SIA-IDE/MMGT.

MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation

TL;DR

Co-speech gesture video generation from audio-driven stills is challenging due to varying regional motions and the need for synchronized lip movements. The paper introduces MMGT, a two-stage framework that uses audio-derived motion masks and motion features to jointly drive gesture video synthesis without relying on extra priors. Stage I (SMGA) generates a pose video and a motion mask from audio and an initial pose, while Stage II embeds Motion Masked Hierarchical Audio Attention (MM-HAA) inside a Stabilized Diffusion framework to produce a high-quality final video with accurate texture and regional details. Experiments show improved video quality, lip-sync, and gesture realism compared with prior methods, demonstrating practical gains from using only audio and a single reference image.

Abstract

Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of different parts of the body in terms of amplitude of motion, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in video, leading to more pronounced artifacts and distortions. Existing approaches typically address this issue by introducing additional a priori information, but this can limit the practical application of the task. Specifically, we propose a Motion Mask-Guided Two-Stage Network (MMGT) that uses audio, as well as motion masks and motion features generated from the audio signal to jointly drive the generation of synchronized speech gesture videos. In the first stage, the Spatial Mask-Guided Audio Pose Generation (SMGA) Network generates high-quality pose videos and motion masks from audio, effectively capturing large movements in key regions such as the face and gestures. In the second stage, we integrate the Motion Masked Hierarchical Audio Attention (MM-HAA) into the Stabilized Diffusion Video Generation model, overcoming limitations in fine-grained motion generation and region-specific detail control found in traditional methods. This guarantees high-quality, detailed upper-body video generation with accurate texture and motion details. Evaluations show improved video quality, lip-sync, and gesture. The model and code are available at https://github.com/SIA-IDE/MMGT.

Paper Structure

This paper contains 3 sections, 3 figures.

Figures (3)

  • Figure 1: Overview of existing models for generating videos of co-speech gestures. Compared with other methods meng2024echomimicv2ruan2022mmdiffusionlin2024cyberhost, our method can generate videos of co-speech gestures for specified characters, and the generation for specific regions does not require additional prior information.
  • Figure 2: Examples of our generated co-speech gestures video. The lips marked with red circles correspond to the bold red letters.
  • Figure 3: Overview of the Proposed MMGT Framework. The framework operates in two stages: In Stage I, the SMGA network generates motion feature videos, including the pose video $\hat{V}_P$ and motion mask $\hat{V}_M$, based on the input audio $a$ and initial pose $p^{(0)}$. In Stage II, the Denoising UNet utilizes $Z_{\text{pose}}$ and $Z_{\text{text}}$, while the ReferenceNet integrates $Z_{\text{pose}}$, $Z_{\text{text}}$, and $Z_{\text{img}}$ to produce the final predicted video $\hat{V}$. On the right, the MM-HAA module enhances $\hat{V}_M$ by aligning audio features $f_a$ with cross-attention embeddings $Z_{\text{CA}}$. The green double dashed line indicates the inference process, where $V_P$ and $V_M$ are replaced by $\hat{V}_P$ and $\hat{V}_M$ when transitioning from training to inference.