Table of Contents
Fetching ...

InfiniteDance: Scalable 3D Dance Generation Towards in-the-wild Generalization

Ronghui Li, Zhongyuan Hu, Li Siyao, Youliang Zhang, Haozhe Xie, Mingyuan Zhang, Jie Guo, Xiu Li, Ziwei Liu

Abstract

Although existing 3D dance generation methods perform well in controlled scenarios, they often struggle to generalize in the wild. When conditioned on unseen music, existing methods often produce unstructured or physically implausible dance, largely due to limited music-to-dance data and restricted model capacity. This work aims to push the frontier of generalizable 3D dance generation by scaling up both data and model design. (1) On the data side, we develop a fully automated pipeline that reconstructs high-fidelity 3D dance motions from monocular videos. To eliminate the physical artifacts prevalent in existing reconstruction methods, we introduce a Foot Restoration Diffusion Model (FRDM) guided by foot-contact and geometric constraints that enforce physical plausibility while preserving kinematic smoothness and expressiveness, resulting in a diverse, high-quality multimodal 3D dance dataset totaling 100.69 hours. (2) On model design, we propose Choreographic LLaMA (ChoreoLLaMA), a scalable LLaMA-based architecture. To enhance robustness under unfamiliar music conditions, we integrate a retrieval-augmented generation (RAG) module that injects reference dance as a prompt. Additionally, we design a slow/fast-cadence Mixture-of-Experts (MoE) module that enables ChoreoLLaMA to smoothly adapt motion rhythms across varying music tempos. Extensive experiments across diverse dance genres show that our approach surpasses existing methods in both qualitative and quantitative evaluations, marking a step toward scalable, real-world 3D dance generation. Code, models, and data will be released.

InfiniteDance: Scalable 3D Dance Generation Towards in-the-wild Generalization

Abstract

Although existing 3D dance generation methods perform well in controlled scenarios, they often struggle to generalize in the wild. When conditioned on unseen music, existing methods often produce unstructured or physically implausible dance, largely due to limited music-to-dance data and restricted model capacity. This work aims to push the frontier of generalizable 3D dance generation by scaling up both data and model design. (1) On the data side, we develop a fully automated pipeline that reconstructs high-fidelity 3D dance motions from monocular videos. To eliminate the physical artifacts prevalent in existing reconstruction methods, we introduce a Foot Restoration Diffusion Model (FRDM) guided by foot-contact and geometric constraints that enforce physical plausibility while preserving kinematic smoothness and expressiveness, resulting in a diverse, high-quality multimodal 3D dance dataset totaling 100.69 hours. (2) On model design, we propose Choreographic LLaMA (ChoreoLLaMA), a scalable LLaMA-based architecture. To enhance robustness under unfamiliar music conditions, we integrate a retrieval-augmented generation (RAG) module that injects reference dance as a prompt. Additionally, we design a slow/fast-cadence Mixture-of-Experts (MoE) module that enables ChoreoLLaMA to smoothly adapt motion rhythms across varying music tempos. Extensive experiments across diverse dance genres show that our approach surpasses existing methods in both qualitative and quantitative evaluations, marking a step toward scalable, real-world 3D dance generation. Code, models, and data will be released.
Paper Structure (18 sections, 5 equations, 10 figures, 11 tables, 2 algorithms)

This paper contains 18 sections, 5 equations, 10 figures, 11 tables, 2 algorithms.

Figures (10)

  • Figure 1: Overview of our motion collection pipeline. Step 1: We estimate whole-body motion from monocular videos, which contain artifacts. Step 2: We refine these motions through motion imitation in a physics simulator to obtain more physically plausible results, but this step often introduces frequent foot jittering. Step 3: We apply our Foot Restoration Diffusion Model (FRDM) to further correct foot motions. The final results show stable root and foot contacts without jittering or penetration artifacts.
  • Figure 2: (a) The Foot Restoration Diffusion Model (FRDM) can be trained in a self-supervised manner. We sample $\bm{x}_0$ from ground-truth motions and obtain $\bm{x}_t$ by adding noise. To repair only the artifacts in the root, knees, and feet, we replace these parts in $\bm{x}_0$ with the corresponding components from $\bm{x}_t$ to obtain $\acute{\bm{x}}_t$. We then train a foot denoising network $\bm{f_{\theta}}. \hat{\bm{x}}_0=\bm{f_{\theta}}(\acute{\bm{x}}_t,t)$, $\bm{j}^p_0=\text{Cumsum}(\bm{j}^v_0),\bm{j}^p_0=\text{FK}(\bm{j}^r_0)$. (b) Given the motion $\bm{x}$ with foot artifacts, we first sampel $\bm{x}_T \sim \mathcal{N}(0,I)$, and get $\acute{\bm{x}}_t$ by replace the root, knee and foot reigon of $\bm{x}$ by those of $\bm{x}_t$. In the early denoising steps $t>t_{th}$, where $t_{th}$ is a threshold, we apply geometric guidance to keep the restored motion geometrically consistent with the original input. In the last steps $t\leq t_{th}$, we use foot-contact guidance to explicitly improve foot stability.
  • Figure 3: (a) Our residual tokenizer maintains multi-layer codebooks. (b) Previous methods input discrete indices (e.g., "[$48_{(1)}$]") to LLaMA. (c) We project continuous quantized embeds $\bm{x}_q$ into $\bm{x}_e$ for LLaMA, preserving fine-grained features.
  • Figure 4: (a) Given a music clip and a target genre, we first use RAG to retrieve the top-k most relevant reference dances. These retrieved dances, together with the given music and genre embeds, are then fed into the Cadence-MoE to produce fused embeds. ChoreoLLaMA then autoregressively predicts dance tokens that are decoded into dance sequences. (b) Each "Expert" is a neural module composed of linear layers, multi-head attention, and a Mamba block. The "RFFT" and "IRFFT" refer to the real-valued Fast Fourier Transform and its inverse, respectively. (c) At the inference phase, the ChoreoLLaMA predicts dance token indices $\hat{\bm{x}}_{idx}$ one by one, we then lookup the quantized embeds $\hat{\bm{x}}_{q}$ in codebooks and project them in to dance embeds $\hat{\bm{x}}_{e}$.
  • Figure 5: Duration Distribution of InfiniteDance
  • ...and 5 more figures