Table of Contents
Fetching ...

SpeechAct: Towards Generating Whole-body Motion from Speech

Jinsong Zhang, Minjie Zhu, Yuxiang Zhang, Yebin Liu, Kun Li

TL;DR

This work tackles generating natural and diverse whole-body motion from speech. It introduces SpeechAct, a framework that combines a hybrid point representation (SMPL-X surface points plus keypoints) with a three-codebook VQ-VAE motion space and a contrastive motion learning–driven translation model for body and hand motions, plus a deterministic face generator for lip-sync. On BEAT2, SpeechAct achieves superior realism and rhythm alignment (FID-k/g, BeatAlign), greater motion diversity (Div-in/Div-out), and reduced foot skating, with faster generation than diffusion-based methods. The approach also demonstrates cross-language generalization and practical avatar-animation applications, highlighting its potential for VR/AR and interactive HCI scenarios.

Abstract

This paper addresses the problem of generating whole-body motion from speech. Despite great successes, prior methods still struggle to produce reasonable and diverse whole-body motions from speech. This is due to their reliance on suboptimal representations and a lack of strategies for generating diverse results. To address these challenges, we present a novel hybrid point representation to achieve accurate and continuous motion generation, e.g., avoiding foot skating, and this representation can be transformed into an easy-to-use representation, i.e., SMPL-X body mesh, for many applications. To generate whole-body motion from speech, for facial motion, closely tied to the audio signal, we introduce an encoder-decoder architecture to achieve deterministic outcomes. However, for the body and hands, which have weaker connections to the audio signal, we aim to generate diverse yet reasonable motions. To boost diversity in motion generation, we propose a contrastive motion learning method to encourage the model to produce more distinctive representations. Specifically, we design a robust VQ-VAE to learn a quantized motion codebook using our hybrid representation. Then, we regress the motion representation from the audio signal by a translation model employing our contrastive motion learning method. Experimental results validate the superior performance and the correctness of our model. The project page is available for research purposes at http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct.

SpeechAct: Towards Generating Whole-body Motion from Speech

TL;DR

This work tackles generating natural and diverse whole-body motion from speech. It introduces SpeechAct, a framework that combines a hybrid point representation (SMPL-X surface points plus keypoints) with a three-codebook VQ-VAE motion space and a contrastive motion learning–driven translation model for body and hand motions, plus a deterministic face generator for lip-sync. On BEAT2, SpeechAct achieves superior realism and rhythm alignment (FID-k/g, BeatAlign), greater motion diversity (Div-in/Div-out), and reduced foot skating, with faster generation than diffusion-based methods. The approach also demonstrates cross-language generalization and practical avatar-animation applications, highlighting its potential for VR/AR and interactive HCI scenarios.

Abstract

This paper addresses the problem of generating whole-body motion from speech. Despite great successes, prior methods still struggle to produce reasonable and diverse whole-body motions from speech. This is due to their reliance on suboptimal representations and a lack of strategies for generating diverse results. To address these challenges, we present a novel hybrid point representation to achieve accurate and continuous motion generation, e.g., avoiding foot skating, and this representation can be transformed into an easy-to-use representation, i.e., SMPL-X body mesh, for many applications. To generate whole-body motion from speech, for facial motion, closely tied to the audio signal, we introduce an encoder-decoder architecture to achieve deterministic outcomes. However, for the body and hands, which have weaker connections to the audio signal, we aim to generate diverse yet reasonable motions. To boost diversity in motion generation, we propose a contrastive motion learning method to encourage the model to produce more distinctive representations. Specifically, we design a robust VQ-VAE to learn a quantized motion codebook using our hybrid representation. Then, we regress the motion representation from the audio signal by a translation model employing our contrastive motion learning method. Experimental results validate the superior performance and the correctness of our model. The project page is available for research purposes at http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct.
Paper Structure (20 sections, 5 equations, 12 figures, 5 tables)

This paper contains 20 sections, 5 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Given an audio input, our model can generate natural and diverse human motion sequences. There are two samples that are uniformly sampled from generated space. The human body meshes corresponding to the same text color indicate the motion generated by the driven speech content.
  • Figure 2: The detailed architecture of SpeechAct. To generate whole-body motion, our model includes a two-stage body generator to generate diverse motions for the body and hands and a face generator to output deterministic results. Specifically, our model includes: (a) a VQ-VAE based on our proposed hybrid point representation to learn a motion codebook, (b) a translation model with a contrastive motion learning method to generate diverse motion codes from the learned motion codebook, and (c) an encoder-decoder architecture to generate deterministic face motion. The red lines indicate these modules are used for training, and the purple lines mean that these modules are applied for both training and inference.
  • Figure 3: The overview of our hybrid representation
  • Figure 4: Details of contrastive motion learning. We take the quantized features from the ground-truth motion as the positive sample, and the generated features from other audios as the negative samples. By pulling away the current generated feature from the negative samples, we can obtain more distinctive representations.
  • Figure 5: Qualitative results compared with TalkShow yi2023generating and EMAGE liu2023emage. The left subfigure shows the continuity and the smoothness of the generated motions, and the right subfigure presents the diversity of the results. In the left subfigure, the first row shows the two different audio inputs, the second row presents the related text, and the other rows show the generated results by different methods. Each sample consists of five frames extracted at intervals of 2/15 seconds from a generated motion clip. Lighter colors represent past frames.
  • ...and 7 more figures