Table of Contents
Fetching ...

Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation

Bohong Chen, Yumeng Li, Yao-Xiang Ding, Tianjia Shao, Kun Zhou

TL;DR

This work proposes SynTalker, which utilizes the off-the-shelf text-to-motion dataset as an auxiliary for supplementing the missing full-body motion and prompts and obtains an aligned embedding space of motion, speech, and prompts despite the significant distributional mismatch between speech-to-motion and text-to-motion datasets.

Abstract

Current co-speech motion generation approaches usually focus on upper body gestures following speech contents only, while lacking supporting the elaborate control of synergistic full-body motion based on text prompts, such as talking while walking. The major challenges lie in 1) the existing speech-to-motion datasets only involve highly limited full-body motions, making a wide range of common human activities out of training distribution; 2) these datasets also lack annotated user prompts. To address these challenges, we propose SynTalker, which utilizes the off-the-shelf text-to-motion dataset as an auxiliary for supplementing the missing full-body motion and prompts. The core technical contributions are two-fold. One is the multi-stage training process which obtains an aligned embedding space of motion, speech, and prompts despite the significant distributional mismatch in motion between speech-to-motion and text-to-motion datasets. Another is the diffusion-based conditional inference process, which utilizes the separate-then-combine strategy to realize fine-grained control of local body parts. Extensive experiments are conducted to verify that our approach supports precise and flexible control of synergistic full-body motion generation based on both speeches and user prompts, which is beyond the ability of existing approaches.

Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation

TL;DR

This work proposes SynTalker, which utilizes the off-the-shelf text-to-motion dataset as an auxiliary for supplementing the missing full-body motion and prompts and obtains an aligned embedding space of motion, speech, and prompts despite the significant distributional mismatch between speech-to-motion and text-to-motion datasets.

Abstract

Current co-speech motion generation approaches usually focus on upper body gestures following speech contents only, while lacking supporting the elaborate control of synergistic full-body motion based on text prompts, such as talking while walking. The major challenges lie in 1) the existing speech-to-motion datasets only involve highly limited full-body motions, making a wide range of common human activities out of training distribution; 2) these datasets also lack annotated user prompts. To address these challenges, we propose SynTalker, which utilizes the off-the-shelf text-to-motion dataset as an auxiliary for supplementing the missing full-body motion and prompts. The core technical contributions are two-fold. One is the multi-stage training process which obtains an aligned embedding space of motion, speech, and prompts despite the significant distributional mismatch in motion between speech-to-motion and text-to-motion datasets. Another is the diffusion-based conditional inference process, which utilizes the separate-then-combine strategy to realize fine-grained control of local body parts. Extensive experiments are conducted to verify that our approach supports precise and flexible control of synergistic full-body motion generation based on both speeches and user prompts, which is beyond the ability of existing approaches.
Paper Structure (32 sections, 9 equations, 9 figures, 3 tables)

This paper contains 32 sections, 9 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The structure of our prompt-based co-speech generation model.
  • Figure 2: Illustration of the training and inference processes. We initially train a contrastive learning space between text and motion, alongside a motion auto-encoder that uses motion from both speech-to-motion and text-to-motion dataset for an expressive latent space. Subsequently, our co-speech latent diffusion model is trained under the guidance of an implicit label extracted from motion using the contrastive space, effectively bypassing the lack of textual motion annotations in co-speech data. During inference, we implement a separate-then-combine strategy in every diffusion step, enabling finer control over individual body parts while preserving their synergistic interaction.
  • Figure 3: Qualitative results for synergistic full-body motion generation. More results are included in the appendix as well as demo videos.
  • Figure 4: Qualitative ablation studies on training and inference procedures. More results are included in the appendix.
  • Figure 5: The t-SNE visualizations of motions before and after RVQVAE (blue: motions in BEATX, red: motions in AMASS).
  • ...and 4 more figures