Table of Contents
Fetching ...

Dexterous Robotic Piano Playing at Scale

Le Chen, Yi Zhao, Jan Schneider, Quankai Gao, Simon Guist, Cheng Qian, Juho Kannala, Bernhard Schölkopf, Joni Pajarinen, Dieter Büchler

TL;DR

Dexterous robotic piano playing is hard due to high-dimensional, contact-rich dynamics. The paper introduces OmniPianist, a Flow Matching Transformer trained on RP1M++ with an Optimal Transport–based fingering mechanism that removes the need for human demonstrations. By combining large-scale specialist RL data and a flow-based imitation learner, OmniPianist achieves multi-piece proficiency and cross-embodiment generalization, reaching competitive performance on hundreds of pieces and strong zero-shot generalization. This work advances scalable, generalist dexterous manipulation and lays groundwork for real-robot deployment and richer musical expression.

Abstract

Endowing robot hands with human-level dexterity has been a long-standing goal in robotics. Bimanual robotic piano playing represents a particularly challenging task: it is high-dimensional, contact-rich, and requires fast, precise control. We present OmniPianist, the first agent capable of performing nearly one thousand music pieces via scalable, human-demonstration-free learning. Our approach is built on three core components. First, we introduce an automatic fingering strategy based on Optimal Transport (OT), allowing the agent to autonomously discover efficient piano-playing strategies from scratch without demonstrations. Second, we conduct large-scale Reinforcement Learning (RL) by training more than 2,000 agents, each specialized in distinct music pieces, and aggregate their experience into a dataset named RP1M++, consisting of over one million trajectories for robotic piano playing. Finally, we employ a Flow Matching Transformer to leverage RP1M++ through large-scale imitation learning, resulting in the OmniPianist agent capable of performing a wide range of musical pieces. Extensive experiments and ablation studies highlight the effectiveness and scalability of our approach, advancing dexterous robotic piano playing at scale.

Dexterous Robotic Piano Playing at Scale

TL;DR

Dexterous robotic piano playing is hard due to high-dimensional, contact-rich dynamics. The paper introduces OmniPianist, a Flow Matching Transformer trained on RP1M++ with an Optimal Transport–based fingering mechanism that removes the need for human demonstrations. By combining large-scale specialist RL data and a flow-based imitation learner, OmniPianist achieves multi-piece proficiency and cross-embodiment generalization, reaching competitive performance on hundreds of pieces and strong zero-shot generalization. This work advances scalable, generalist dexterous manipulation and lays groundwork for real-robot deployment and richer musical expression.

Abstract

Endowing robot hands with human-level dexterity has been a long-standing goal in robotics. Bimanual robotic piano playing represents a particularly challenging task: it is high-dimensional, contact-rich, and requires fast, precise control. We present OmniPianist, the first agent capable of performing nearly one thousand music pieces via scalable, human-demonstration-free learning. Our approach is built on three core components. First, we introduce an automatic fingering strategy based on Optimal Transport (OT), allowing the agent to autonomously discover efficient piano-playing strategies from scratch without demonstrations. Second, we conduct large-scale Reinforcement Learning (RL) by training more than 2,000 agents, each specialized in distinct music pieces, and aggregate their experience into a dataset named RP1M++, consisting of over one million trajectories for robotic piano playing. Finally, we employ a Flow Matching Transformer to leverage RP1M++ through large-scale imitation learning, resulting in the OmniPianist agent capable of performing a wide range of musical pieces. Extensive experiments and ablation studies highlight the effectiveness and scalability of our approach, advancing dexterous robotic piano playing at scale.

Paper Structure

This paper contains 30 sections, 14 equations, 9 figures, 9 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview. This paper proposes an RL-based agent that alleviates the requirement for human-annotated fingering by formulating finger placement as an Optimal Transport (OT) problem, enabling the agent to play music pieces without human demonstrations, even for the challenging "Flight of the Bumblebee" song. We then collect a large-scale dataset named RP1M++ consists of more than 1 million expert piano-playing trajectories collected by training more than 2,000 RL agents. The collected data is consumed by a multi-task agent, named OmniPianist, via imitation learning that is capable of playing hundreds of musical pieces.
  • Figure 2: Statistics of our RP1M / RP1M++ dataset. (Top) Histogram of pressed keys in our dataset. (Bottom Left) Distribution of the number of active keys over all time steps. (Bottom Right) Distribution of F1 scores of RL agents used to collect the dataset.
  • Figure 3: Flow Matching Transformer architecture. The transformer is conditioned on the goal extracted from MIDI files, robot hand fingertip positions, robot hand states, and piano states. Noised action tokens are linearly embedded, combined with learned positional embeddings, and passed through an $N$-block stack of Transformer decoder layers ($N=12$ in our case). The flow matching timestep is encoded using a sinusoidal embedding and concatenated with linearly projected observation tokens to form the conditioning sequence. This sequence is processed by a lightweight per-token MLP and serves as the cross-attention memory for the decoder. Finally, a layer norm and linear head map the token features to continuous actions.
  • Figure 4: Comparison of the RL performance with our OT fingering, human-annotated fingering, and no fingering. Our method matches the performance of RoboPianist-RL, which is trained with human fingering. Our method also outperforms the baseline without any fingering information by a large margin. The plots show the mean over 3 random seeds, and the shaded areas represent the 95% confidence interval.
  • Figure 5: Comparison of fingering discovered by the agent itself and human annotations. We visualize a sample trajectory of playing the French Suite No.5 Sarabande, along with the corresponding fingering. The agent discovers a fingering strategy that differs from human annotations, adapting to hardware constraints while accurately pressing the target keys.
  • ...and 4 more figures