Table of Contents
Fetching ...

Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

Simon Alexanderson, Rajmund Nagy, Jonas Beskow, Gustav Eje Henter

TL;DR

The paper tackles the challenge of generating high-quality, audio-driven 3D human motion for gestures and dance using diffusion models. It introduces a Conformer-based diffusion architecture adapted from DiffWave, with classifier-free guidance to independently control motion style, and extends this framework with product-of-expert ensembles for style interpolation and cross-model synthesis. Through extensive subjective and objective evaluations on gesture and dance datasets, the approach achieves superior motion quality and demonstrates flexible style expression, including path-driven locomotion. The work also provides a dataset release and outlines avenues for speedups and multimodal conditioning, underscoring the practical potential for controllable, probabilistic audio-driven animation.

Abstract

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest. See https://www.speech.kth.se/research/listen-denoise-action/ for video examples, data, and code.

Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

TL;DR

The paper tackles the challenge of generating high-quality, audio-driven 3D human motion for gestures and dance using diffusion models. It introduces a Conformer-based diffusion architecture adapted from DiffWave, with classifier-free guidance to independently control motion style, and extends this framework with product-of-expert ensembles for style interpolation and cross-model synthesis. Through extensive subjective and objective evaluations on gesture and dance datasets, the approach achieves superior motion quality and demonstrates flexible style expression, including path-driven locomotion. The work also provides a dataset release and outlines avenues for speedups and multimodal conditioning, underscoring the practical potential for controllable, probabilistic audio-driven animation.

Abstract

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest. See https://www.speech.kth.se/research/listen-denoise-action/ for video examples, data, and code.
Paper Structure (34 sections, 8 equations, 8 figures, 4 tables)

This paper contains 34 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Architecture diagrams. Each subfigure illustrates a component of the prior subfigure. Rectangular boxes are vectors and scalars, rounded boxes are neural networks or learnt operations, and ovals are fixed mathematical operations.
  • Figure 2: Screenshot of the user interface used for subjective evaluations.
  • Figure 3: Dance synthesised from our trained diffusion model in the Locking and Krumping styles. See \ref{['fig:teaser']} for the Jazz style. Avatar © Motorica AB.
  • Figure 4: 3D stick-figure skeleton visualisation excerpted from dance-evaluation video.
  • Figure 5: Locomotion generated by our model, conditioned on a circular path and a style that always holds the left arm out. Avatar © Motorica AB.
  • ...and 3 more figures