Table of Contents
Fetching ...

Generative human motion mimicking through feature extraction in denoising diffusion settings

Alexander Okupnik, Johannes Schneider, Kyriakos Flouris

TL;DR

Problem: enabling embodied, interactive AI dance using single-person motion data. Approach: a diffusion-based pipeline (EDGE) augmented with motion inpainting and ILVR for style-conditioned, time-coherent imitation of a reference sequence while preserving improvisation. Contributions: a duet-free interaction framework, real-time-capable style-guided editing, and analysis of mimicry as a tunable follow strength that balances fidelity and diversity. Findings: longer ILVR refinement pulls generated motion closer to the reference and improves alignment while maintaining diversity within a practical operating range, demonstrated on the AIST++ dataset. Significance: advances embodied human–AI collaboration in dance, enabling expressive co-creation with an AI partner trained on solo motion data, with potential applications in performance and therapy.

Abstract

Recent success with large language models has sparked a new wave of verbal human-AI interaction. While such models support users in a variety of creative tasks, they lack the embodied nature of human interaction. Dance, as a primal form of human expression, is predestined to complement this experience. To explore creative human-AI interaction exemplified by dance, we build an interactive model based on motion capture (MoCap) data. It generates an artificial other by partially mimicking and also "creatively" enhancing an incoming sequence of movement data. It is the first model, which leverages single-person motion data and high level features in order to do so and, thus, it does not rely on low level human-human interaction data. It combines ideas of two diffusion models, motion inpainting, and motion style transfer to generate movement representations that are both temporally coherent and responsive to a chosen movement reference. The success of the model is demonstrated by quantitatively assessing the convergence of the feature distribution of the generated samples and the test set which serves as simulating the human performer. We show that our generations are first steps to creative dancing with AI as they are both diverse showing various deviations from the human partner while appearing realistic.

Generative human motion mimicking through feature extraction in denoising diffusion settings

TL;DR

Problem: enabling embodied, interactive AI dance using single-person motion data. Approach: a diffusion-based pipeline (EDGE) augmented with motion inpainting and ILVR for style-conditioned, time-coherent imitation of a reference sequence while preserving improvisation. Contributions: a duet-free interaction framework, real-time-capable style-guided editing, and analysis of mimicry as a tunable follow strength that balances fidelity and diversity. Findings: longer ILVR refinement pulls generated motion closer to the reference and improves alignment while maintaining diversity within a practical operating range, demonstrated on the AIST++ dataset. Significance: advances embodied human–AI collaboration in dance, enabling expressive co-creation with an AI partner trained on solo motion data, with potential applications in performance and therapy.

Abstract

Recent success with large language models has sparked a new wave of verbal human-AI interaction. While such models support users in a variety of creative tasks, they lack the embodied nature of human interaction. Dance, as a primal form of human expression, is predestined to complement this experience. To explore creative human-AI interaction exemplified by dance, we build an interactive model based on motion capture (MoCap) data. It generates an artificial other by partially mimicking and also "creatively" enhancing an incoming sequence of movement data. It is the first model, which leverages single-person motion data and high level features in order to do so and, thus, it does not rely on low level human-human interaction data. It combines ideas of two diffusion models, motion inpainting, and motion style transfer to generate movement representations that are both temporally coherent and responsive to a chosen movement reference. The success of the model is demonstrated by quantitatively assessing the convergence of the feature distribution of the generated samples and the test set which serves as simulating the human performer. We show that our generations are first steps to creative dancing with AI as they are both diverse showing various deviations from the human partner while appearing realistic.

Paper Structure

This paper contains 21 sections, 10 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Human dances with AI in 3D. 3D motion data is captured (e.g. using a motion tracking suit) and decomposed into low and high frequency components. Low frequency components are maintained to align AI and human movements and create a form of interactivity, while high frequency components are sampled from a diffusion model combining multiple ideas to express diversity.
  • Figure 2: Depiction of the style-transfer process mimicking a reference sample $y_0$. The process starts with sampling $x_T$ from a normal distribution together with a noisy version of the reference sample $y_T$. In each iteration $t$, a low-pass filter $\phi_L$ is applied to $y_t$ and a high-pass filter $\phi_H$ to $x_t$. The sum is fed to the network $p$, which denoises it into the next iteration $x_{t-1}$.
  • Figure 3: DDIM: Non-Markovian sampling process, predicting the denoised sample conditioned on the previous sample and an estimation of the clean version $\hat{x}_0$.
  • Figure 4: Snapshots from the music and dance dataset AIST++ li2021ai.
  • Figure 5: Snapshots of samples from the test set (first row) compared to snapshots of model with interaction strength 40,30,20 at same time frame. One can observe that increasing interaction strength leads to stronger mimicry, i.e. in the orientation of the body and the swinging of the arms.