PianoMime: Learning a Generalist, Dexterous Piano Player from Internet Demonstrations

Cheng Qian; Julen Urain; Kevin Zakka; Jan Peters

PianoMime: Learning a Generalist, Dexterous Piano Player from Internet Demonstrations

Cheng Qian, Julen Urain, Kevin Zakka, Jan Peters

TL;DR

PianoMime tackles the problem of learning a dexterous, generalist piano-playing agent from Internet demonstrations. It presents a three-phase approach: data preparation to extract human fingertip trajectories and MIDI piano states from videos, policy learning to train song-specific experts with residual RL and style rewards, and policy distillation to a single generalist via behavioral cloning with representation learning and hierarchical design. The method is evaluated on a RoboPianist-like setup using 60 songs over 431 clips, with unseen clips used to assess generalization, and shows improvements over a robotic piano baseline, reporting up to $0.94$ F1 on some song-specific tests and substantial gains from distillation and architectural choices. The authors release the dataset and trained models as benchmarks, highlighting the viability of leveraging Internet data for dexterous manipulation and the potential for scalable generalist policies in musically rich tasks.

Abstract

In this work, we introduce PianoMime, a framework for training a piano-playing agent using internet demonstrations. The internet is a promising source of large-scale demonstrations for training our robot agents. In particular, for the case of piano-playing, Youtube is full of videos of professional pianists playing a wide myriad of songs. In our work, we leverage these demonstrations to learn a generalist piano-playing agent capable of playing any arbitrary song. Our framework is divided into three parts: a data preparation phase to extract the informative features from the Youtube videos, a policy learning phase to train song-specific expert policies from the demonstrations and a policy distillation phase to distil the policies into a single generalist agent. We explore different policy designs to represent the agent and evaluate the influence of the amount of training data on the generalization capability of the agent to novel songs not available in the dataset. We show that we are able to learn a policy with up to 56\% F1 score on unseen songs.

PianoMime: Learning a Generalist, Dexterous Piano Player from Internet Demonstrations

TL;DR

F1 on some song-specific tests and substantial gains from distillation and architectural choices. The authors release the dataset and trained models as benchmarks, highlighting the viability of leveraging Internet data for dexterous manipulation and the potential for scalable generalist policies in musically rich tasks.

Abstract

Paper Structure (21 sections, 3 equations, 8 figures, 10 tables, 2 algorithms)

This paper contains 21 sections, 3 equations, 8 figures, 10 tables, 2 algorithms.

Introduction
Related Work
Method
Data preparation: From raw data to human and piano state trajectories
Policy learning: generating robot actions from observations
Policy distillation: learning a generalist piano-playing agent
Experimental Results
Evaluation on learning song-specific policies from demonstrations
Evaluation of model design strategies for policy distillation
Evaluations on the impact of the data in the generalization
Limitations
Conclusion
Retargeting: From human hand to robot hand
Implementation of Inverse Kinematics Solver
Detailed MDP Formulation of Song-specific Policy
...and 6 more sections

Figures (8)

Figure 1: The goal of this work is to train a generalist piano-playing agent (PianoMime) from Youtube videos. We collect a set of videos and accompanying MIDI files and train a single agent to play any song, combining reinforcement learning and behavioral cloning.
Figure 2: Proposed distillation policy architecture. Given a L steps window of a target song $\tau^t_{\musEighth} :(\musEighth_{t:t+L})$ at time $t$, a latent representation $\tau_{\bm{z}}^t$ is computed given a pre-trained observation encoder. Then, the policy is decoupled between a high-level fingertip predictor that generates a trajectory of fingertip positions $\tau_{\bm{x}}^t$ and a low-level inverse dynamics model that generates a trajectory of target joint position $\tau_{{\bm{q}}}^t$.
Figure 3: Left: Qualitative comparison of hand postures. Middle: The F1 score achieved by three methods for 10 chosen clips; Right: The F1 score achieved by excluding different elements in .
Figure 4: Qualitative comparison of hand poses. Top: Youtube video, Middle: solution given the video. Bottom: After residual .
Figure 5: Precision and Recall for three different policy architectures trained with varying amount of data volumes evaluated on the test dataset. Top: Models are trained with the same proportion of high-level and low-level datasets. Bottom: Models are trained with different proportions of high-level and low-level datasets. The x-axis represents the percentage of the low-level dataset utilized, while HL % indicates the percentage of the high-level dataset used.
...and 3 more figures

PianoMime: Learning a Generalist, Dexterous Piano Player from Internet Demonstrations

TL;DR

Abstract

PianoMime: Learning a Generalist, Dexterous Piano Player from Internet Demonstrations

Authors

TL;DR

Abstract

Table of Contents

Figures (8)