Table of Contents
Fetching ...

DFM: Deep Fourier Mimic for Expressive Dance Motion Learning

Ryo Watanabe, Chenhao Li, Marco Hutter

TL;DR

DFM tackles the expressiveness gap in robot dancing by relaxing the local periodic constraints of prior latent representations and introducing fresh motion encodings via Fourier Latent Dynamics. By encoding reference dances into dynamic latent parameters $f_t$, $a_t$, $b_t$, and $oldsymbol{\phi_t}$ and employing PPO-based reinforcement learning, DFM enables accurate tracking and smooth transitions between diverse motions while simultaneously performing auxiliary tasks like locomotion and gaze control. Hardware experiments on an Aibo demonstrate significant improvements in tracking accuracy ($MAE$ reductions from $0.132$ rad to $0.094$ rad on aibo, and $0.125$ rad to $0.103$ rad on a MIT humanoid) and show natural transitions via latent-space interpolation, as well as robust multi-task capabilities during dance. The approach holds practical impact for human-robot interaction in expressive entertainment robots by enabling dynamic, interactive performances rather than static motion replay, with continuous frequency modulation and seamless cross-motion transitions.

Abstract

As entertainment robots gain popularity, the demand for natural and expressive motion, particularly in dancing, continues to rise. Traditionally, dancing motions have been manually designed by artists, a process that is both labor-intensive and restricted to simple motion playback, lacking the flexibility to incorporate additional tasks such as locomotion or gaze control during dancing. To overcome these challenges, we introduce Deep Fourier Mimic (DFM), a novel method that combines advanced motion representation with Reinforcement Learning (RL) to enable smooth transitions between motions while concurrently managing auxiliary tasks during dance sequences. While previous frequency domain based motion representations have successfully encoded dance motions into latent parameters, they often impose overly rigid periodic assumptions at the local level, resulting in reduced tracking accuracy and motion expressiveness, which is a critical aspect for entertainment robots. By relaxing these locally periodic constraints, our approach not only enhances tracking precision but also facilitates smooth transitions between different motions. Furthermore, the learned RL policy that supports simultaneous base activities, such as locomotion and gaze control, allows entertainment robots to engage more dynamically and interactively with users rather than merely replaying static, pre-designed dance routines.

DFM: Deep Fourier Mimic for Expressive Dance Motion Learning

TL;DR

DFM tackles the expressiveness gap in robot dancing by relaxing the local periodic constraints of prior latent representations and introducing fresh motion encodings via Fourier Latent Dynamics. By encoding reference dances into dynamic latent parameters , , , and and employing PPO-based reinforcement learning, DFM enables accurate tracking and smooth transitions between diverse motions while simultaneously performing auxiliary tasks like locomotion and gaze control. Hardware experiments on an Aibo demonstrate significant improvements in tracking accuracy ( reductions from rad to rad on aibo, and rad to rad on a MIT humanoid) and show natural transitions via latent-space interpolation, as well as robust multi-task capabilities during dance. The approach holds practical impact for human-robot interaction in expressive entertainment robots by enabling dynamic, interactive performances rather than static motion replay, with continuous frequency modulation and seamless cross-motion transitions.

Abstract

As entertainment robots gain popularity, the demand for natural and expressive motion, particularly in dancing, continues to rise. Traditionally, dancing motions have been manually designed by artists, a process that is both labor-intensive and restricted to simple motion playback, lacking the flexibility to incorporate additional tasks such as locomotion or gaze control during dancing. To overcome these challenges, we introduce Deep Fourier Mimic (DFM), a novel method that combines advanced motion representation with Reinforcement Learning (RL) to enable smooth transitions between motions while concurrently managing auxiliary tasks during dance sequences. While previous frequency domain based motion representations have successfully encoded dance motions into latent parameters, they often impose overly rigid periodic assumptions at the local level, resulting in reduced tracking accuracy and motion expressiveness, which is a critical aspect for entertainment robots. By relaxing these locally periodic constraints, our approach not only enhances tracking precision but also facilitates smooth transitions between different motions. Furthermore, the learned RL policy that supports simultaneous base activities, such as locomotion and gaze control, allows entertainment robots to engage more dynamically and interactively with users rather than merely replaying static, pre-designed dance routines.

Paper Structure

This paper contains 14 sections, 4 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Deep Fourier Mimic (DFM) allows entertainment robots such as aibo to seamlessly combine artistic motions crafted by designers with auxiliary tasks like locomotion or gaze towards a human face, resulting in expressive motion that can smoothly transition between different movements at arbitrary timings. Project webpage: https://sony.github.io/DFM/
  • Figure 2: The expressive dance motion learning system is composed of four key components: motion design, motion representation, motion learning, and hardware inference. In the motion design phase, artists create motion references using specialized design software. The representation of these diverse motions is then learned using a pae. Reinforcement learning (rl) is employed to enable the robot to perform auxiliary tasks, such as walking and head orientation control, while accurately tracking the designed dance references. During inference, the learned policy is deployed on the actual hardware, allowing for real-time execution of dance motions and dynamic and interactive motions by tracking the auxiliary task commands.
  • Figure 3: Height reached by rear right leg. Left, middle and right depict reference motion, fld and dfm motions respectively. The red dash line illustrates the height of the right rear leg at reference motion.
  • Figure 4: Comparison of tracking accuracy for the fld and dfm. Blue: reference motion created by the motion designer. Orange: reconstructed motions from motion representation parts by conditioning the reference motion. Green: joint encoder reading activated by the rl policy.
  • Figure 5: Comparison of 8 channel latent parameters for fld at the left and dfm at the right side by conditioning the same dancing motion as Fig. \ref{['fig:moveup_rear_leg']}. The upper and bottom of plots are $\sin{\phi}$ and frequency for each.
  • ...and 5 more figures