HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies

Amber Xie; Haozhi Qi; Dorsa Sadigh

HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies

Amber Xie, Haozhi Qi, Dorsa Sadigh

Abstract

Mastering dexterous manipulation with multi-fingered hands has been a grand challenge in robotics for decades. Despite its potential, the difficulty of collecting high-quality data remains a primary bottleneck for high-precision tasks. While reinforcement learning and simulation-to-real-world transfer offer a promising alternative, the transferred policies often fail for tasks demanding millimeter-scale precision, such as bimanual piano playing. In this work, we introduce HandelBot, a framework that combines a simulation policy and rapid adaptation through a two-stage pipeline. Starting from a simulation-trained policy, we first apply a structured refinement stage to correct spatial alignments by adjusting lateral finger joints based on physical rollouts. Next, we use residual reinforcement learning to autonomously learn fine-grained corrective actions. Through extensive hardware experiments across five recognized songs, we demonstrate that HandelBot can successfully perform precise bimanual piano playing. Our system outperforms direct simulation deployment by a factor of 1.8x and requires only 30 minutes of physical interaction data.

HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies

Abstract

Paper Structure (30 sections, 3 equations, 5 figures, 4 tables)

This paper contains 30 sections, 3 equations, 5 figures, 4 tables.

INTRODUCTION
RELATED WORK
Real-World Piano Playing
Problem Statement
Reinforcement Learning in Simulation
Reward Design
Observations and Actions
Policy Refinement
Lateral Joint Correction
Iterative Updates
Chunked Updates
Real-World Residual Reinforcement Learning
Residual Policy Formulation
Residual RL Objective
Guided Noise
...and 15 more sections

Figures (5)

Figure 1: HandelBot Method (0) RL in Sim. We leverage fast, parallel simulators for reinforcement learning. This leads to a coarse base policy, $\pi_{sim}$, from which we extract an open-loop rollout, $\tau_{sim}$. (1) Policy Refinement. Second, we refine $\tau_{sim}$, yielding $\tau^*_{sim}$. We use real-world updates to iteratively update the lateral joints of the fingers, moving the finger horizontally in the direction of the keys it is intended to press. (2) Residual RL. We perform residual RL atop $\tau_{sim}$, using the keyboard's MIDI output as a reward. This allows us to further update our policy for better piano playing.
Figure 2: Hardware Setup. We use a MIDI keyboard, two Tesollo DG-5F hands, and two Franka arms for piano playing. We use the MIDI output from the piano, which tells us which notes are pressed, in order to calculate rewards. We emphasize that the robot hands are far larger than the average human hand, thus making piano playing difficult. Finally, for RL training, we include a collision checker which prevents fingers from pressing down beyond the keys.
Figure 3: Main Results. We include F1 score, multiplied by 100, for 5 songs. HandelBot consistently achieves the strongest F1 score, showing the importance of effectively using real-world samples to accomplish precise, dexteorus piano-playing. Methods only using simulated data, such as $\pi_{sim}$ (CL) and $\pi_{sim}$, have weak performance due to the sim-to-real gap.
Figure 4: Visualization of HandelBot Trajectories. Per each song, we visualize the notes pressed correctly, pressed incorrectly, and missed. The x axis is the timestep of the song, and the y axis are the different notes, with the top half representing keys for the right hand, and the bottom for the left hand. Across easier songs such as Twinkle Twinkle and Ode to Joy, we find that HandelBot makes few mistakes, with occasional timing errors or wrong presses. For harder songs such as Fur Elise, large jumps in the left hand notes (bottom section of each song plot) are challenging for the left hand.
Figure 5: HandelBot Trajectories across Residual RL Training. We include 4 evaluation trajectories during HandelBot training, with the final, best-performing trajectory in \ref{['fig:vis']}. Across these 4 trajectories, we see that HandelBot initially struggles with many keys in the left hand. However, with real-world interactions, the residual policy is able to adapt to real world and press the correct keys.

HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies

Abstract

HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies

Authors

Abstract

Table of Contents

Figures (5)