Table of Contents
Fetching ...

T3M: Text Guided 3D Human Motion Synthesis from Speech

Wenshuo Peng, Kaipeng Zhang, Sai Qian Zhang

TL;DR

T3M tackles the limited controllability of speech-driven 3D motion by introducing a text-guided framework that conditions holistic body, hand, and facial motions on speech and textual prompts. It combines a VQ-VAE body-hand codebook, an EnCodec audio feature extractor, and a VideoCLIP-informed context feature module within a cross-attention–based multimodal fusion block, enabling diverse, controllable motion generation. Through training on the SHOW dataset and extensive ablations and user studies, T3M demonstrates improved realism and beat-synchrony compared with state-of-the-art approaches, while showcasing clear responsiveness to textual input. The approach offers practical impact for AI-driven animation and film production by enabling nuanced, user-guided motion synthesis from speech.

Abstract

Speech-driven 3D motion synthesis seeks to create lifelike animations based on human speech, with potential uses in virtual reality, gaming, and the film production. Existing approaches reply solely on speech audio for motion generation, leading to inaccurate and inflexible synthesis results. To mitigate this problem, we introduce a novel text-guided 3D human motion synthesis method, termed \textit{T3M}. Unlike traditional approaches, T3M allows precise control over motion synthesis via textual input, enhancing the degree of diversity and user customization. The experiment results demonstrate that T3M can greatly outperform the state-of-the-art methods in both quantitative metrics and qualitative evaluations. We have publicly released our code at \href{https://github.com/Gloria2tt/T3M.git}{https://github.com/Gloria2tt/T3M.git}

T3M: Text Guided 3D Human Motion Synthesis from Speech

TL;DR

T3M tackles the limited controllability of speech-driven 3D motion by introducing a text-guided framework that conditions holistic body, hand, and facial motions on speech and textual prompts. It combines a VQ-VAE body-hand codebook, an EnCodec audio feature extractor, and a VideoCLIP-informed context feature module within a cross-attention–based multimodal fusion block, enabling diverse, controllable motion generation. Through training on the SHOW dataset and extensive ablations and user studies, T3M demonstrates improved realism and beat-synchrony compared with state-of-the-art approaches, while showcasing clear responsiveness to textual input. The approach offers practical impact for AI-driven animation and film production by enabling nuanced, user-guided motion synthesis from speech.

Abstract

Speech-driven 3D motion synthesis seeks to create lifelike animations based on human speech, with potential uses in virtual reality, gaming, and the film production. Existing approaches reply solely on speech audio for motion generation, leading to inaccurate and inflexible synthesis results. To mitigate this problem, we introduce a novel text-guided 3D human motion synthesis method, termed \textit{T3M}. Unlike traditional approaches, T3M allows precise control over motion synthesis via textual input, enhancing the degree of diversity and user customization. The experiment results demonstrate that T3M can greatly outperform the state-of-the-art methods in both quantitative metrics and qualitative evaluations. We have publicly released our code at \href{https://github.com/Gloria2tt/T3M.git}{https://github.com/Gloria2tt/T3M.git}
Paper Structure (34 sections, 4 equations, 4 figures, 4 tables)

This paper contains 34 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Under the same audio input, extrovert and introvert persons will talk in a completely different fashion.
  • Figure 2: Overview of the proposed T3M. We employ a novel framework for body and hand motion generation. Specifically, T3M first learns a quantized body-hand codebook through a VQ-VAE model. In the training phase, we the pre-trained EnCodec model to extract the speech embedding of the given speech. We employ the pre-trained video encoder from VideoCLIP to obtain the video embedding that corresponds to the provided speech. To facilitate interaction between these two modalities, we utilize a multimodal fusion block. This fusion block is built upon a BERT-based framework, enhanced with a cross-attention layer for effective fusion.
  • Figure 3: Visualization of 3D holistic motions generated by TalkSHOW and T3M. For T3M, three different text prompts are provided and the positions of the hand are highlighted with black boxes. We notice that the hand motions are closely aligned with the input text desription in T3M.
  • Figure 4: Experiments on unseen speech. We use two different text input to control the motion generation.