Table of Contents
Fetching ...

Semantics-Aware Human Motion Generation from Audio Instructions

Zi-An Wang, Shihao Zou, Shiyao Yu, Mingyuan Zhang, Chao Dong

TL;DR

This work tackles generating human motion from audio instructions by proposing an end-to-end masked generative transformer equipped with a memory-retrieval attention module. Audio features from WavLM are compressed into fixed conditioning signals that drive a two-stage RVQ-VAE–based motion generator, producing base and residual motion codes and reconstructing sequences with a motion decoder. Augmenting existing text–motion datasets with conversational-style rewrites and multi-speaker audio yields Original and Oral datasets, enabling robust evaluation of audio-conditioned generation. Experiments show that audio semantics can match text in guiding motion while providing substantial efficiency gains over cascaded approaches, and that models trained on the Oral dataset demonstrate improved robustness and instruction-congruence in real-world audio conditions.

Abstract

Recent advances in interactive technologies have highlighted the prominence of audio signals for semantic encoding. This paper explores a new task, where audio signals are used as conditioning inputs to generate motions that align with the semantics of the audio. Unlike text-based interactions, audio provides a more natural and intuitive communication method. However, existing methods typically focus on matching motions with music or speech rhythms, which often results in a weak connection between the semantics of the audio and generated motions. We propose an end-to-end framework using a masked generative transformer, enhanced by a memory-retrieval attention module to handle sparse and lengthy audio inputs. Additionally, we enrich existing datasets by converting descriptions into conversational style and generating corresponding audio with varied speaker identities. Experiments demonstrate the effectiveness and efficiency of the proposed framework, demonstrating that audio instructions can convey semantics similar to text while providing more practical and user-friendly interactions.

Semantics-Aware Human Motion Generation from Audio Instructions

TL;DR

This work tackles generating human motion from audio instructions by proposing an end-to-end masked generative transformer equipped with a memory-retrieval attention module. Audio features from WavLM are compressed into fixed conditioning signals that drive a two-stage RVQ-VAE–based motion generator, producing base and residual motion codes and reconstructing sequences with a motion decoder. Augmenting existing text–motion datasets with conversational-style rewrites and multi-speaker audio yields Original and Oral datasets, enabling robust evaluation of audio-conditioned generation. Experiments show that audio semantics can match text in guiding motion while providing substantial efficiency gains over cascaded approaches, and that models trained on the Oral dataset demonstrate improved robustness and instruction-congruence in real-world audio conditions.

Abstract

Recent advances in interactive technologies have highlighted the prominence of audio signals for semantic encoding. This paper explores a new task, where audio signals are used as conditioning inputs to generate motions that align with the semantics of the audio. Unlike text-based interactions, audio provides a more natural and intuitive communication method. However, existing methods typically focus on matching motions with music or speech rhythms, which often results in a weak connection between the semantics of the audio and generated motions. We propose an end-to-end framework using a masked generative transformer, enhanced by a memory-retrieval attention module to handle sparse and lengthy audio inputs. Additionally, we enrich existing datasets by converting descriptions into conversational style and generating corresponding audio with varied speaker identities. Experiments demonstrate the effectiveness and efficiency of the proposed framework, demonstrating that audio instructions can convey semantics similar to text while providing more practical and user-friendly interactions.

Paper Structure

This paper contains 14 sections, 6 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overview of Our Work. Given an audio instruction as the conditional signal (with text included for reference purposes only), our generative model is able to produce high-quality human motion sequences that accurately align with the semantics of the audio input.
  • Figure 2: Distribution of Audio Feature Lengths. WavLM chen2022wavlm is utilized to extract audio features from the augmented Oral Datasets derived from HumanML3D guo2022generating and KIT plappert2016kit. A statistical analysis of these feature lengths reveals significant variability, with some features exhibiting notably long lengths. This variability presents challenges for processing conditional signals, as it complicates the integration of audio data into subsequent stages of our framework.
  • Figure 3: Audio Conditions Processing Pipeline. The audio features extracted by WavLM chen2022wavlm are processed through a memory-retrieval based module, which standardizes the varying lengths of the input audio conditions. This module ensures that all audio signals are converted into a consistent length, facilitating smooth integration with subsequent components in the pipeline. The processed features will be used as conditional signals input into the generative model.
  • Figure 4: Overview of Our Generative Framework in Training and Inference. The framework consists of two key components: The Masked Transformer is designed to model the relationship between audio conditions and the base motion codes, which capture the principal components of the motion. The Residual Transformer establishes the connection between the audio conditions and the residual motion codes, which represent the finer, detailed aspects of the motion. During inference, these transformers work in sequential stages, progressively generating multi-layer latent motion codes that are then decoded to reconstruct the full motion sequence.
  • Figure 5: Overview of the Audio-Motion Dataset Augmentation Process. The texts from existing datasets, KIT plappert2016kit and HumanML3D guo2022generating, are fed into the text2speech model, Tortoise betker2023better, generating audio signals with random speaker identities to create the Original Dataset. Additionally, the large language model ChatGPT-3.5 openai2024gpt is employed to rewrite the original texts in a more conversational, spoken language style. These rewritten texts are then used to generate corresponding audio signals, forming the Oral Dataset.
  • ...and 3 more figures