Table of Contents
Fetching ...

Skeleton2Stage: Reward-Guided Fine-Tuning for Physically Plausible Dance Generation

Jidong Jia, Youjian Zhang, Huan Fu, Dacheng Tao

TL;DR

Experiments on multiple dance datasets consistently demonstrate that the deriving physics-based rewards from the body mesh and applying Reinforcement Learning Fine-Tuning can significantly improve the physical plausibility of generated motions, yielding more realistic and aesthetically pleasing dances.

Abstract

Despite advances in dance generation, most methods are trained in the skeletal domain and ignore mesh-level physical constraints. As a result, motions that look plausible as joint trajectories often exhibit body self-penetration and Foot-Ground Contact (FGC) anomalies when visualized with a human body mesh, reducing the aesthetic appeal of generated dances and limiting their real-world applications. We address this skeleton-to-mesh gap by deriving physics-based rewards from the body mesh and applying Reinforcement Learning Fine-Tuning (RLFT) to steer the diffusion model toward physically plausible motion synthesis under mesh visualization. Our reward design combines (i) an imitation reward that measures a motion's general plausibility by its imitability in a physical simulator (penalizing penetration and foot skating), and (ii) a Foot-Ground Deviation (FGD) reward with test-time FGD guidance to better capture the dynamic foot-ground interaction in dance. However, we find that the physics-based rewards tend to push the model to generate freezing motions for fewer physical anomalies and better imitability. To mitigate it, we propose an anti-freezing reward to preserve motion dynamics while maintaining physical plausibility. Experiments on multiple dance datasets consistently demonstrate that our method can significantly improve the physical plausibility of generated motions, yielding more realistic and aesthetically pleasing dances. The project page is available at: https://jjd1123.github.io/Skeleton2Stage/

Skeleton2Stage: Reward-Guided Fine-Tuning for Physically Plausible Dance Generation

TL;DR

Experiments on multiple dance datasets consistently demonstrate that the deriving physics-based rewards from the body mesh and applying Reinforcement Learning Fine-Tuning can significantly improve the physical plausibility of generated motions, yielding more realistic and aesthetically pleasing dances.

Abstract

Despite advances in dance generation, most methods are trained in the skeletal domain and ignore mesh-level physical constraints. As a result, motions that look plausible as joint trajectories often exhibit body self-penetration and Foot-Ground Contact (FGC) anomalies when visualized with a human body mesh, reducing the aesthetic appeal of generated dances and limiting their real-world applications. We address this skeleton-to-mesh gap by deriving physics-based rewards from the body mesh and applying Reinforcement Learning Fine-Tuning (RLFT) to steer the diffusion model toward physically plausible motion synthesis under mesh visualization. Our reward design combines (i) an imitation reward that measures a motion's general plausibility by its imitability in a physical simulator (penalizing penetration and foot skating), and (ii) a Foot-Ground Deviation (FGD) reward with test-time FGD guidance to better capture the dynamic foot-ground interaction in dance. However, we find that the physics-based rewards tend to push the model to generate freezing motions for fewer physical anomalies and better imitability. To mitigate it, we propose an anti-freezing reward to preserve motion dynamics while maintaining physical plausibility. Experiments on multiple dance datasets consistently demonstrate that our method can significantly improve the physical plausibility of generated motions, yielding more realistic and aesthetically pleasing dances. The project page is available at: https://jjd1123.github.io/Skeleton2Stage/
Paper Structure (20 sections, 7 equations, 5 figures, 9 tables)

This paper contains 20 sections, 7 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: The overview of our method. Our method formulates the denoising process as a multi-step Markov Decision Process, allowing diffusion models to be fine-tuned via RL. To incorporate physical constraints into diffusion models, we introduce physics-based rewards, including an imitation reward assessing the general physical plausibility with an imitation policy and an FGD reward to handle the dynamic nature of dance. Additionally, we design an anti-freezing reward to mitigate the physics-based rewards' preference for freezing motions.
  • Figure 2: The visual comparisons of EDGE tseng2023edge and our generated motions. Both motion sequences are generated with the same music and seed. Some body parts are enlarged for a better view. The red box signifies the presence of body penetration, while the green box indicates the improvement after the RLFT. The subscript number denotes the frame number.
  • Figure 3: The visual comparison for ablation studies. Each compared dance pair is generated from the same audio track. Left compares the results of PhysDiff with those of our proposed method. As shown, motion projection can result in falling motions due to the inability to accurately imitate the physically implausible movements. The right presents an example result of the model trained without an anti-freezing reward, in which the model tends to generate small-amplitude movements.
  • Figure 4: The performance of $\text{FID}_\text{k}$ and PFC during the fine-tuning process of EDGE.
  • Figure 5: Key differences: PhysDiff vs Ours.