Table of Contents
Fetching ...

PianoMotion10M: Dataset and Benchmark for Hand Motion Generation in Piano Performance

Qijun Gan, Song Wang, Shengtao Wu, Jianke Zhu

TL;DR

This work addresses the challenge of generating realistic hand motions and fingering from piano music by introducing PianoMotion10M, a large-scale dataset with 116 hours of performances and 10 million annotated hand poses, linked to MIDI data. It proposes a two-stage baseline that first predicts hand positions from audio and then uses a diffusion-based gesture generator conditioned on those positions to produce continuous hand motions, evaluated with metrics like FID, FGD, WGD, PD, and Smoothness. The dataset and a accompanying benchmark enable research on audio-to-motion and fingering analysis for piano, potentially advancing AI-assisted piano instruction and performance simulation. By open-sourcing the dataset and code, the work aims to catalyze developments in hand-motion generation, piano fingering, and multimodal music understanding.

Abstract

Recently, artificial intelligence techniques for education have been received increasing attentions, while it still remains an open problem to design the effective music instrument instructing systems. Although key presses can be directly derived from sheet music, the transitional movements among key presses require more extensive guidance in piano performance. In this work, we construct a piano-hand motion generation benchmark to guide hand movements and fingerings for piano playing. To this end, we collect an annotated dataset, PianoMotion10M, consisting of 116 hours of piano playing videos from a bird's-eye view with 10 million annotated hand poses. We also introduce a powerful baseline model that generates hand motions from piano audios through a position predictor and a position-guided gesture generator. Furthermore, a series of evaluation metrics are designed to assess the performance of the baseline model, including motion similarity, smoothness, positional accuracy of left and right hands, and overall fidelity of movement distribution. Despite that piano key presses with respect to music scores or audios are already accessible, PianoMotion10M aims to provide guidance on piano fingering for instruction purposes. The source code and dataset can be accessed at https://github.com/agnJason/PianoMotion10M.

PianoMotion10M: Dataset and Benchmark for Hand Motion Generation in Piano Performance

TL;DR

This work addresses the challenge of generating realistic hand motions and fingering from piano music by introducing PianoMotion10M, a large-scale dataset with 116 hours of performances and 10 million annotated hand poses, linked to MIDI data. It proposes a two-stage baseline that first predicts hand positions from audio and then uses a diffusion-based gesture generator conditioned on those positions to produce continuous hand motions, evaluated with metrics like FID, FGD, WGD, PD, and Smoothness. The dataset and a accompanying benchmark enable research on audio-to-motion and fingering analysis for piano, potentially advancing AI-assisted piano instruction and performance simulation. By open-sourcing the dataset and code, the work aims to catalyze developments in hand-motion generation, piano fingering, and multimodal music understanding.

Abstract

Recently, artificial intelligence techniques for education have been received increasing attentions, while it still remains an open problem to design the effective music instrument instructing systems. Although key presses can be directly derived from sheet music, the transitional movements among key presses require more extensive guidance in piano performance. In this work, we construct a piano-hand motion generation benchmark to guide hand movements and fingerings for piano playing. To this end, we collect an annotated dataset, PianoMotion10M, consisting of 116 hours of piano playing videos from a bird's-eye view with 10 million annotated hand poses. We also introduce a powerful baseline model that generates hand motions from piano audios through a position predictor and a position-guided gesture generator. Furthermore, a series of evaluation metrics are designed to assess the performance of the baseline model, including motion similarity, smoothness, positional accuracy of left and right hands, and overall fidelity of movement distribution. Despite that piano key presses with respect to music scores or audios are already accessible, PianoMotion10M aims to provide guidance on piano fingering for instruction purposes. The source code and dataset can be accessed at https://github.com/agnJason/PianoMotion10M.
Paper Structure (21 sections, 6 equations, 7 figures, 4 tables)

This paper contains 21 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of our framework. We collect videos of professional piano performances from the Internet and process them to construct a large-scale dataset, PianoMotion10M, which comprises piano music, MIDI files and hand motions. Building upon this dataset, we establish a benchmark for generating hand motions from piano music.
  • Figure 2: Illustration of sample from PianoMotion10M. Each sample in our dataset includes audio, hand pose annotations, and a MIDI file along with the corresponding Bilibili video ID.
  • Figure 3: Illustration of our baseline model. Given a piece of piano music, our baseline model estimates the hand motions by predicting hand positions and generating hand gestures.
  • Figure 4: Illustration of the qualitative results. We display the generated gestures across frames using different methods. Our method stands out due to its greater fidelity, as shown in the examples.
  • Figure A1: Distribution of Note Clicks and Volume Levels in the PianoMotion10M Dataset. The top figure depicts note click frequency, and the bottom one shows the volume distribution.
  • ...and 2 more figures