MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

Sen Wang; Jiangning Zhang; Xin Tan; Zhifeng Xie; Chengjie Wang; Lizhuang Ma

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

Sen Wang, Jiangning Zhang, Xin Tan, Zhifeng Xie, Chengjie Wang, Lizhuang Ma

TL;DR

MMoFusion, a Multi-modal co-speech Motion generation framework based on the diffusion model to ensure both the authenticity and diversity of generated motion, and a progressive fusion strategy to enhance the interaction of inter-modal and intra-modal, efficiently integrating multi-modal information.

Abstract

The body movements accompanying speech aid speakers in expressing their ideas. Co-speech motion generation is one of the important approaches for synthesizing realistic avatars. Due to the intricate correspondence between speech and motion, generating realistic and diverse motion is a challenging task. In this paper, we propose MMoFusion, a Multi-modal co-speech Motion generation framework based on the diffusion model to ensure both the authenticity and diversity of generated motion. We propose a progressive fusion strategy to enhance the interaction of inter-modal and intra-modal, efficiently integrating multi-modal information. Specifically, we employ a masked style matrix based on emotion and identity information to control the generation of different motion styles. Temporal modeling of speech and motion is partitioned into style-guided specific feature encoding and shared feature encoding, aiming to learn both inter-modal and intra-modal features. Besides, we propose a geometric loss to enforce the joints' velocity and acceleration coherence among frames. Our framework generates vivid, diverse, and style-controllable motion of arbitrary length through inputting speech and editing identity and emotion. Extensive experiments demonstrate that our method outperforms current co-speech motion generation methods including upper body and challenging full body.

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

TL;DR

Abstract

Paper Structure (17 sections, 12 equations, 10 figures, 6 tables)

This paper contains 17 sections, 12 equations, 10 figures, 6 tables.

Introduction
Related Work
Human Motion Generation
Multi-Modal Learning
Method
Preliminary DDPM
Progressive Fusion
Sampling
Experiments
Datasets and Experimental Setting
Quantitative Results
Qualitative Results
Model Analysis
Conclusion
Limitations and Future Work
...and 2 more sections

Figures (10)

Figure 1: Our MMoFusion framework generates realistic, coherent, and diverse motions conditioned on speech, editable identities, and emotions. The top and bottom show motion results with different identities and emotions.
Figure 2: Comparison of our method with existing multi-modal motion generation methods. Early Fusion: CaMNliu2022beat uses simple concatenation, MDMtevet2023human utilizes conditional token. Mid Fusion: DiffuseStyleGesture yang2023diffusestylegesture leverages cross-local attention to establish intermediate representations. We propose a Progressive Fusion Strategy to fully learn multi-modal features.
Figure 3: Overview of MMoFusion framework. We propose a Progressive Fusion Strategy (PFS) to fuse multi-modal information including 1) Feature Processing. A noisy motion sequence $x_t$ at time step $t$ is fed into the diffusion model conditioning on multi-modal information. Speech feature $\mathbf{s}$ is obtained by concatenating the transcript and audio features extracted from pre-trained models. We utilize a masked style matrix $\mathbf{m}_{s}$ to guide motion generation. It is mapped into a style token $\mathbf{z}_{s}$ during the whole multi-modal fusion. 2) Specific Feature Encoding. Speech feature $\mathbf{s}$ and motion feature $\mathbf{x}$ are encoded, respectively to obtain the specific features $\mathbf{s}'$ and $\mathbf{x}'$. 3) Shared Feature Encoding. Shared feature $\mathbf{f}$ is obtained by fusing specific features with cross-attention. Finally, the motion $\hat{x}_{0}$ is generated by the hybrid feature $\mathbf{f}'$ aggregated from the specific and shared features and guided by three different style tokens $\mathbf{z}_{s}^{i}$ and time tokens $\mathbf{z}_{t}^{i}$. Inference. For the diffusion model, at each time step $t$, we predict the $\hat{x}_{0}$ with the denoising process based on the corresponding multi-modal conditions, then add the noise to $\hat{x}_{0}$ at time step $t-1$ with the diffuse process.
Figure 4: We train MMoFusion on 10-second clips and generate motion of any length by interpolating the tail of the previous motion clip and the head of the next motion clip, as represented by the overlapped clips in the batch that share the same color.
Figure 5: Visual comparisons of upper body and full body motion generation results.
...and 5 more figures

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

TL;DR

Abstract

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (10)