Table of Contents
Fetching ...

Can Audio Reveal Music Performance Difficulty? Insights from the Piano Syllabus Dataset

Pedro Ramoneda, Minhee Lee, Dasaem Jeong, J. J. Valero-Mas, Xavier Serra

TL;DR

This work introduces the first audio-based approach to estimating piano performance difficulty by releasing the Piano Syllabus (PSyllabus) dataset (7,901 pieces, 11 difficulty levels, 1,233 composers) and two benchmark collections. It adopts a CRNN with attention operating on unimodal inputs (CQT and piano-roll) and multimodal fusion strategies to predict difficulty as an ordinal label, with evaluations on accuracy, MSE, and rank correlations. The study demonstrates that piano-roll representations outperform CQT, that early multimodal fusion yields the best results, and that era-based multi-task learning offers gains while other auxiliary tasks show mixed impact. The work also investigates generalization through zero-shot evaluation on Hidden Voices, and shows that aggregating multiple performances can improve robustness. Overall, PSyllabus provides a robust foundation for audio-based difficulty estimation in MIR and education, with public data and code to spur further research and practical adoption.

Abstract

Automatically estimating the performance difficulty of a music piece represents a key process in music education to create tailored curricula according to the individual needs of the students. Given its relevance, the Music Information Retrieval (MIR) field depicts some proof-of-concept works addressing this task that mainly focuses on high-level music abstractions such as machine-readable scores or music sheet images. In this regard, the potential of directly analyzing audio recordings has been generally neglected, which prevents students from exploring diverse music pieces that may not have a formal symbolic-level transcription. This work pioneers in the automatic estimation of performance difficulty of music pieces on audio recordings with two precise contributions: (i) the first audio-based difficulty estimation dataset -- namely, Piano Syllabus (PSyllabus) dataset -- featuring 7,901 piano pieces across 11 difficulty levels from 1,233 composers; and (ii) a recognition framework capable of managing different input representations -- both unimodal and multimodal manners -- directly derived from audio to perform the difficulty estimation task. The comprehensive experimentation comprising different pre-training schemes, input modalities, and multi-task scenarios prove the validity of the proposal and establishes PSyllabus as a reference dataset for audio-based difficulty estimation in the MIR field. The dataset as well as the developed code and trained models are publicly shared to promote further research in the field.

Can Audio Reveal Music Performance Difficulty? Insights from the Piano Syllabus Dataset

TL;DR

This work introduces the first audio-based approach to estimating piano performance difficulty by releasing the Piano Syllabus (PSyllabus) dataset (7,901 pieces, 11 difficulty levels, 1,233 composers) and two benchmark collections. It adopts a CRNN with attention operating on unimodal inputs (CQT and piano-roll) and multimodal fusion strategies to predict difficulty as an ordinal label, with evaluations on accuracy, MSE, and rank correlations. The study demonstrates that piano-roll representations outperform CQT, that early multimodal fusion yields the best results, and that era-based multi-task learning offers gains while other auxiliary tasks show mixed impact. The work also investigates generalization through zero-shot evaluation on Hidden Voices, and shows that aggregating multiple performances can improve robustness. Overall, PSyllabus provides a robust foundation for audio-based difficulty estimation in MIR and education, with public data and code to spur further research and practical adoption.

Abstract

Automatically estimating the performance difficulty of a music piece represents a key process in music education to create tailored curricula according to the individual needs of the students. Given its relevance, the Music Information Retrieval (MIR) field depicts some proof-of-concept works addressing this task that mainly focuses on high-level music abstractions such as machine-readable scores or music sheet images. In this regard, the potential of directly analyzing audio recordings has been generally neglected, which prevents students from exploring diverse music pieces that may not have a formal symbolic-level transcription. This work pioneers in the automatic estimation of performance difficulty of music pieces on audio recordings with two precise contributions: (i) the first audio-based difficulty estimation dataset -- namely, Piano Syllabus (PSyllabus) dataset -- featuring 7,901 piano pieces across 11 difficulty levels from 1,233 composers; and (ii) a recognition framework capable of managing different input representations -- both unimodal and multimodal manners -- directly derived from audio to perform the difficulty estimation task. The comprehensive experimentation comprising different pre-training schemes, input modalities, and multi-task scenarios prove the validity of the proposal and establishes PSyllabus as a reference dataset for audio-based difficulty estimation in the MIR field. The dataset as well as the developed code and trained models are publicly shared to promote further research in the field.
Paper Structure (24 sections, 1 equation, 8 figures, 7 tables)

This paper contains 24 sections, 1 equation, 8 figures, 7 tables.

Figures (8)

  • Figure 1: We introduce a recognition framework that utilizes both piano-roll and Constant-Q Transform audio-derived representations to estimate the difficulty of a given piece. The model is trained on the novel PSyllabus dataset using distinct configurations (unimodal and multimodal) and explores multiple training strategies, including multi-tasking with auxiliary tasks, offering valuable insights into the task.
  • Figure 2: Prompt engineering template for ChatGPT (version 4) used to validate the consistency of the PSyllabus dataset.
  • Figure 3: Possible scenarios in the metadata of two given music pieces before the prompt engineering filtering stage.
  • Figure 4: Era distribution of PSyllabus dataset.
  • Figure 5: Composer distribution of PSyllabus dataset.
  • ...and 3 more figures