Table of Contents
Fetching ...

MOSPA: Human Motion Generation Driven by Spatial Audio

Shuyang Xu, Zhiyang Dou, Mingyi Shi, Liang Pan, Leo Ho, Jingbo Wang, Yuan Liu, Cheng Lin, Yuexin Ma, Wenping Wang, Taku Komura

TL;DR

This work addresses the gap in spatial-audio-driven motion generation by introducing the SAM dataset and a diffusion-based MOSPA model that conditions human motion on spatial audio features extracted from binaural signals. MOSPA uses MFCC/Tempogram/RMS-based audio features and SSL cues to guide a transformer-based diffusion process that outputs SMPL-X body motions with residual fusion to capture subtle audio–motion influences. Empirical results on SAM show state-of-the-art performance across metrics like $\text{R-precision}$, $\text{FID}$, and $\text{APD}$, complemented by user studies confirming improved intent alignment and motion realism. The work highlights the importance of spatial cues in audio-driven animation and points to future improvements in physical realism, hand/face motions, and scene-aware generation.

Abstract

Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse, realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation

MOSPA: Human Motion Generation Driven by Spatial Audio

TL;DR

This work addresses the gap in spatial-audio-driven motion generation by introducing the SAM dataset and a diffusion-based MOSPA model that conditions human motion on spatial audio features extracted from binaural signals. MOSPA uses MFCC/Tempogram/RMS-based audio features and SSL cues to guide a transformer-based diffusion process that outputs SMPL-X body motions with residual fusion to capture subtle audio–motion influences. Empirical results on SAM show state-of-the-art performance across metrics like , , and , complemented by user studies confirming improved intent alignment and motion realism. The work highlights the importance of spatial cues in audio-driven animation and points to future improvements in physical realism, hand/face motions, and scene-aware generation.

Abstract

Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse, realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation

Paper Structure

This paper contains 20 sections, 6 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: We introduce a novel human motion generation task centered on spatial audio-driven human motion synthesis. Top row: We curate a novel Spatial Audio-Driven Human Motion (SAM) dataset, including diverse spatial audio signals and high-quality 3D human motion pairs. Bottom row: We develop a generative framework for human MOtion generation driven by SPatial Audio (MOSPA) to produce high-quality, responsive human motion driven by spatial audio. We note that the motion generation results are both realistic and responsive, effectively capturing both the spatial and semantic features of spatial audio inputs.
  • Figure 2: Visualization of samples from SAM with expected motions annotated. Red dots indicate the actor's trajectory, while the blue sphere represents the sound source. The SAM dataset ensures high diversity by encompassing a broad spectrum of audio types and varying sound source locations.
  • Figure 3: Spatial audio-driven human motion data collection setup.
  • Figure 4: Statistics of action duration in the dataset.
  • Figure 5: The framework of MOSPA. We perform diffusion-based motion generation given spatial audio inputs. Specifically, Gaussian noise is added to the clean motion sample $\mathbf{x_0}$, generating a noisy motion vector $\mathbf{x_t}$, modeled as $q(\mathbf{x_t}|\mathbf{x_{t-1}})$. An encoder transformer then predicts the clean motion from the noisy motion $\mathbf{x_t}$, guided by extracted audio features $\mathbf{a}$, sound source location (SSL) $\mathbf{s}$, motion genre $g$, and timestep $t$.
  • ...and 7 more figures