DFM: Deep Fourier Mimic for Expressive Dance Motion Learning
Ryo Watanabe, Chenhao Li, Marco Hutter
TL;DR
DFM tackles the expressiveness gap in robot dancing by relaxing the local periodic constraints of prior latent representations and introducing fresh motion encodings via Fourier Latent Dynamics. By encoding reference dances into dynamic latent parameters $f_t$, $a_t$, $b_t$, and $oldsymbol{\phi_t}$ and employing PPO-based reinforcement learning, DFM enables accurate tracking and smooth transitions between diverse motions while simultaneously performing auxiliary tasks like locomotion and gaze control. Hardware experiments on an Aibo demonstrate significant improvements in tracking accuracy ($MAE$ reductions from $0.132$ rad to $0.094$ rad on aibo, and $0.125$ rad to $0.103$ rad on a MIT humanoid) and show natural transitions via latent-space interpolation, as well as robust multi-task capabilities during dance. The approach holds practical impact for human-robot interaction in expressive entertainment robots by enabling dynamic, interactive performances rather than static motion replay, with continuous frequency modulation and seamless cross-motion transitions.
Abstract
As entertainment robots gain popularity, the demand for natural and expressive motion, particularly in dancing, continues to rise. Traditionally, dancing motions have been manually designed by artists, a process that is both labor-intensive and restricted to simple motion playback, lacking the flexibility to incorporate additional tasks such as locomotion or gaze control during dancing. To overcome these challenges, we introduce Deep Fourier Mimic (DFM), a novel method that combines advanced motion representation with Reinforcement Learning (RL) to enable smooth transitions between motions while concurrently managing auxiliary tasks during dance sequences. While previous frequency domain based motion representations have successfully encoded dance motions into latent parameters, they often impose overly rigid periodic assumptions at the local level, resulting in reduced tracking accuracy and motion expressiveness, which is a critical aspect for entertainment robots. By relaxing these locally periodic constraints, our approach not only enhances tracking precision but also facilitates smooth transitions between different motions. Furthermore, the learned RL policy that supports simultaneous base activities, such as locomotion and gaze control, allows entertainment robots to engage more dynamically and interactively with users rather than merely replaying static, pre-designed dance routines.
