Table of Contents
Fetching ...

FürElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance

Ruocheng Wang, Pei Xu, Haochen Shi, Elizabeth Schumann, C. Karen Liu

TL;DR

This paper constructs a first-of-its-kind large-scale dataset that contains approximately 10 hours of 3D hand motion and audio from 15 elite-level pianists playing 153 pieces of classical music and develops a pipeline that can synthesize physically-plausible hand motions for musical scores outside of the dataset.

Abstract

Piano playing requires agile, precise, and coordinated hand control that stretches the limits of dexterity. Hand motion models with the sophistication to accurately recreate piano playing have a wide range of applications in character animation, embodied AI, biomechanics, and VR/AR. In this paper, we construct a first-of-its-kind large-scale dataset that contains approximately 10 hours of 3D hand motion and audio from 15 elite-level pianists playing 153 pieces of classical music. To capture natural performances, we designed a markerless setup in which motions are reconstructed from multi-view videos using state-of-the-art pose estimation models. The motion data is further refined via inverse kinematics using the high-resolution MIDI key-pressing data obtained from sensors in a specialized Yamaha Disklavier piano. Leveraging the collected dataset, we developed a pipeline that can synthesize physically-plausible hand motions for musical scores outside of the dataset. Our approach employs a combination of imitation learning and reinforcement learning to obtain policies for physics-based bimanual control involving the interaction between hands and piano keys. To solve the sampling efficiency problem with the large motion dataset, we use a diffusion model to generate natural reference motions, which provide high-level trajectory and fingering (finger order and placement) information. However, the generated reference motion alone does not provide sufficient accuracy for piano performance modeling. We then further augmented the data by using musical similarity to retrieve similar motions from the captured dataset to boost the precision of the RL policy. With the proposed method, our model generates natural, dexterous motions that generalize to music from outside the training dataset.

FürElise: Capturing and Physically Synthesizing Hand Motions of Piano Performance

TL;DR

This paper constructs a first-of-its-kind large-scale dataset that contains approximately 10 hours of 3D hand motion and audio from 15 elite-level pianists playing 153 pieces of classical music and develops a pipeline that can synthesize physically-plausible hand motions for musical scores outside of the dataset.

Abstract

Piano playing requires agile, precise, and coordinated hand control that stretches the limits of dexterity. Hand motion models with the sophistication to accurately recreate piano playing have a wide range of applications in character animation, embodied AI, biomechanics, and VR/AR. In this paper, we construct a first-of-its-kind large-scale dataset that contains approximately 10 hours of 3D hand motion and audio from 15 elite-level pianists playing 153 pieces of classical music. To capture natural performances, we designed a markerless setup in which motions are reconstructed from multi-view videos using state-of-the-art pose estimation models. The motion data is further refined via inverse kinematics using the high-resolution MIDI key-pressing data obtained from sensors in a specialized Yamaha Disklavier piano. Leveraging the collected dataset, we developed a pipeline that can synthesize physically-plausible hand motions for musical scores outside of the dataset. Our approach employs a combination of imitation learning and reinforcement learning to obtain policies for physics-based bimanual control involving the interaction between hands and piano keys. To solve the sampling efficiency problem with the large motion dataset, we use a diffusion model to generate natural reference motions, which provide high-level trajectory and fingering (finger order and placement) information. However, the generated reference motion alone does not provide sufficient accuracy for piano performance modeling. We then further augmented the data by using musical similarity to retrieve similar motions from the captured dataset to boost the precision of the RL policy. With the proposed method, our model generates natural, dexterous motions that generalize to music from outside the training dataset.
Paper Structure (42 sections, 14 equations, 11 figures, 2 tables)

This paper contains 42 sections, 14 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Our paper (a) collects the first large-scale 3D hand motion dataset of piano playing, accompanied by synchronized audio and key pressing events; (b) proposes a method that can control a physically simulated hand to play novel pieces 'unheard' from the training set.
  • Figure 2: Overview of our pipeline to reconstruct motion data from multi-view videos. We (a) shoot 4K videos from 5 different views at 59.94 FPS using RGB camera; (b) detect 2D keypoints of the hands from each view; (c) triangulate the 2D keypoints into 3D hand skeletons with calibrated camera intrinsics and extrinsics; (d) fit the skeleton onto MANO hand meshes manohand; and (e) run IK with ground-truth MIDI as end effector goals to refine the finger placements for correct key pressing.
  • Figure 3: Data capture setup. Five GoPro cameras are placed around the piano to provide multi-view recordings of elite pianists' performances.
  • Figure 4: Examples of some piano skills in our dataset, including scales, octaves, and arpeggio. The trajectory of each fingertip is visualized. The green keys show the pressed keys through the trajectory.
  • Figure 5: Overview of our method to physically simulate piano performance from a given sheet music. We use MIDI to retrieve motion data from the collected motion dataset and as input to a diffusion model for generating piano performance motions. These two sets of motions are combined into a reference motion ensemble. Utilizing the reference motions, we then employ two discriminator ensembles and three critics, which consider imitation and goal rewards, respectively, to train a control policy via reinforcement learning.
  • ...and 6 more figures