Semantics-Aware Human Motion Generation from Audio Instructions
Zi-An Wang, Shihao Zou, Shiyao Yu, Mingyuan Zhang, Chao Dong
TL;DR
This work tackles generating human motion from audio instructions by proposing an end-to-end masked generative transformer equipped with a memory-retrieval attention module. Audio features from WavLM are compressed into fixed conditioning signals that drive a two-stage RVQ-VAE–based motion generator, producing base and residual motion codes and reconstructing sequences with a motion decoder. Augmenting existing text–motion datasets with conversational-style rewrites and multi-speaker audio yields Original and Oral datasets, enabling robust evaluation of audio-conditioned generation. Experiments show that audio semantics can match text in guiding motion while providing substantial efficiency gains over cascaded approaches, and that models trained on the Oral dataset demonstrate improved robustness and instruction-congruence in real-world audio conditions.
Abstract
Recent advances in interactive technologies have highlighted the prominence of audio signals for semantic encoding. This paper explores a new task, where audio signals are used as conditioning inputs to generate motions that align with the semantics of the audio. Unlike text-based interactions, audio provides a more natural and intuitive communication method. However, existing methods typically focus on matching motions with music or speech rhythms, which often results in a weak connection between the semantics of the audio and generated motions. We propose an end-to-end framework using a masked generative transformer, enhanced by a memory-retrieval attention module to handle sparse and lengthy audio inputs. Additionally, we enrich existing datasets by converting descriptions into conversational style and generating corresponding audio with varied speaker identities. Experiments demonstrate the effectiveness and efficiency of the proposed framework, demonstrating that audio instructions can convey semantics similar to text while providing more practical and user-friendly interactions.
