MOSPA: Human Motion Generation Driven by Spatial Audio
Shuyang Xu, Zhiyang Dou, Mingyi Shi, Liang Pan, Leo Ho, Jingbo Wang, Yuan Liu, Cheng Lin, Yuexin Ma, Wenping Wang, Taku Komura
TL;DR
This work addresses the gap in spatial-audio-driven motion generation by introducing the SAM dataset and a diffusion-based MOSPA model that conditions human motion on spatial audio features extracted from binaural signals. MOSPA uses MFCC/Tempogram/RMS-based audio features and SSL cues to guide a transformer-based diffusion process that outputs SMPL-X body motions with residual fusion to capture subtle audio–motion influences. Empirical results on SAM show state-of-the-art performance across metrics like $\text{R-precision}$, $\text{FID}$, and $\text{APD}$, complemented by user studies confirming improved intent alignment and motion realism. The work highlights the importance of spatial cues in audio-driven animation and points to future improvements in physical realism, hand/face motions, and scene-aware generation.
Abstract
Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse, realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation
