Table of Contents
Fetching ...

Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model

Sangjune Park, Inhyeok Choi, Donghyeon Soon, Youngwoo Jeon, Kyungdon Joo

TL;DR

A new dance generation approach that leverages a Mamba-based diffusion model well-suited to handling long and autoregressive sequences, and considering the critical role of musical beats in dance choreography, proposes a Gaussian-based beat representation to explicitly guide the decoding of dance sequences.

Abstract

Dance is a form of human motion characterized by emotional expression and communication, playing a role in various fields such as music, virtual reality, and content creation. Existing methods for dance generation often fail to adequately capture the inherently sequential, rhythmical, and music-synchronized characteristics of dance. In this paper, we propose \emph{MambaDance}, a new dance generation approach that leverages a Mamba-based diffusion model. Mamba, well-suited to handling long and autoregressive sequences, is integrated into our two-stage diffusion architecture, substituting off-the-shelf Transformer. Additionally, considering the critical role of musical beats in dance choreography, we propose a Gaussian-based beat representation to explicitly guide the decoding of dance sequences. Experiments on AIST++ and FineDance datasets for each sequence length show that our proposed method effectively generates plausible dance movements while reflecting essential characteristics, consistently from short to long dances, compared to the previous methods. Additional qualitative results and demo videos are available at \small{https://vision3d-lab.github.io/mambadance}.

Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model

TL;DR

A new dance generation approach that leverages a Mamba-based diffusion model well-suited to handling long and autoregressive sequences, and considering the critical role of musical beats in dance choreography, proposes a Gaussian-based beat representation to explicitly guide the decoding of dance sequences.

Abstract

Dance is a form of human motion characterized by emotional expression and communication, playing a role in various fields such as music, virtual reality, and content creation. Existing methods for dance generation often fail to adequately capture the inherently sequential, rhythmical, and music-synchronized characteristics of dance. In this paper, we propose \emph{MambaDance}, a new dance generation approach that leverages a Mamba-based diffusion model. Mamba, well-suited to handling long and autoregressive sequences, is integrated into our two-stage diffusion architecture, substituting off-the-shelf Transformer. Additionally, considering the critical role of musical beats in dance choreography, we propose a Gaussian-based beat representation to explicitly guide the decoding of dance sequences. Experiments on AIST++ and FineDance datasets for each sequence length show that our proposed method effectively generates plausible dance movements while reflecting essential characteristics, consistently from short to long dances, compared to the previous methods. Additional qualitative results and demo videos are available at \small{https://vision3d-lab.github.io/mambadance}.
Paper Structure (28 sections, 12 equations, 5 figures, 3 tables)

This paper contains 28 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We propose MambaDance, a Mamba-based two-stage diffusion framework with an informative Gaussian beat representation. The result is coherent, beat-synchronized motion across variable lengths on AIST++ fact and FineDance finedance.
  • Figure 2: The overall architecture of MambaDance. We extract music feature $m$, and a novel beat representation $b$ from the binary mask of beat of the feature (blue box). Two-stage diffusion architecture makes our approach enable length-agnostic generation in a single inference (green box). Decoder of the diffusion consists of the proposed Mamba mambamamba2-based modules, e.g., Single-Modal Mamba (SMM), Cross-Modal Mamba (CMM), and Adaptive Linear Modulation (AdaLM) (gray box).
  • Figure 3: Single-Modal Mamba (left) and Cross-Modal Mamba (right). For the input sequences to the Cross-Modal Mamba, Light blue, dark blue, and pink blocks correspond to motion, condition, and timestep tokens, respectively.
  • Figure 4: Visualizations of raw beat (a), Nearest Beat Distance (NBD) (b), and our Gaussian beat representation (c). Horizontal axis denotes frame indices (time step) of a sequence and vertical axis indicates signal. The signal range of NBD and the proposed representation are $[0,11]$ and $[0, 1]$, respectively.
  • Figure 5: Qualitative comparison on the FineDance (top) and AIST++ (bottom) dataset. Each row shows a set of sampled frames captured at consistent intervals from the full sequence.