Table of Contents
Fetching ...

PBSCR: The Piano Bootleg Score Composer Recognition Dataset

Arhan Jain, Alec Bunn, Austin Pham, TJ Tsai

TL;DR

PBSCR introduces a large-scale, accessible dataset for piano composer recognition by harvesting IMSLP sheet music images and encoding notehead locations as 62×L binary bootleg scores. The dataset includes 40k 62×64 labeled samples for 9-class and 100-class tasks, plus 29,310 unlabeled bootleg scores for self-supervised pretraining, with rich IMSLP metadata to support multimodal research. Baseline experiments with CNNs, GPT-2, and RoBERTa show substantial gains from unlabeled pretraining and fine-tuning, but also reveal data leakage challenges and substantial room for improvement, especially in 100-class classification. The authors discuss encoding choices, augmentation strategies, and multimodal integration as fertile directions, highlighting PBSCR’s potential to spur scalable, cross-modal composer-recognition research. The dataset is released with code and links to related datasets to enable community benchmarking and reproducible advances.

Abstract

This article motivates, describes, and presents the PBSCR dataset for studying composer recognition of classical piano music. Our goal was to design a dataset that facilitates large-scale research on composer recognition that is suitable for modern architectures and training practices. To achieve this goal, we utilize the abundance of sheet music images and rich metadata on IMSLP, use a previously proposed feature representation called a bootleg score to encode the location of noteheads relative to staff lines, and present the data in an extremely simple format (2D binary images) to encourage rapid exploration and iteration. The dataset itself contains 40,000 62x64 bootleg score images for a 9-class recognition task, 100,000 62x64 bootleg score images for a 100-class recognition task, and 29,310 unlabeled variable-length bootleg score images for pretraining. The labeled data is presented in a form that mirrors MNIST images, in order to make it extremely easy to visualize, manipulate, and train models in an efficient manner. We include relevant information to connect each bootleg score image with its underlying raw sheet music image, and we scrape, organize, and compile metadata from IMSLP on all piano works to facilitate multimodal research and allow for convenient linking to other datasets. We release baseline results in a supervised and low-shot setting for future works to compare against, and we discuss open research questions that the PBSCR data is especially well suited to facilitate research on.

PBSCR: The Piano Bootleg Score Composer Recognition Dataset

TL;DR

PBSCR introduces a large-scale, accessible dataset for piano composer recognition by harvesting IMSLP sheet music images and encoding notehead locations as 62×L binary bootleg scores. The dataset includes 40k 62×64 labeled samples for 9-class and 100-class tasks, plus 29,310 unlabeled bootleg scores for self-supervised pretraining, with rich IMSLP metadata to support multimodal research. Baseline experiments with CNNs, GPT-2, and RoBERTa show substantial gains from unlabeled pretraining and fine-tuning, but also reveal data leakage challenges and substantial room for improvement, especially in 100-class classification. The authors discuss encoding choices, augmentation strategies, and multimodal integration as fertile directions, highlighting PBSCR’s potential to spur scalable, cross-modal composer-recognition research. The dataset is released with code and links to related datasets to enable community benchmarking and reproducible advances.

Abstract

This article motivates, describes, and presents the PBSCR dataset for studying composer recognition of classical piano music. Our goal was to design a dataset that facilitates large-scale research on composer recognition that is suitable for modern architectures and training practices. To achieve this goal, we utilize the abundance of sheet music images and rich metadata on IMSLP, use a previously proposed feature representation called a bootleg score to encode the location of noteheads relative to staff lines, and present the data in an extremely simple format (2D binary images) to encourage rapid exploration and iteration. The dataset itself contains 40,000 62x64 bootleg score images for a 9-class recognition task, 100,000 62x64 bootleg score images for a 100-class recognition task, and 29,310 unlabeled variable-length bootleg score images for pretraining. The labeled data is presented in a form that mirrors MNIST images, in order to make it extremely easy to visualize, manipulate, and train models in an efficient manner. We include relevant information to connect each bootleg score image with its underlying raw sheet music image, and we scrape, organize, and compile metadata from IMSLP on all piano works to facilitate multimodal research and allow for convenient linking to other datasets. We release baseline results in a supervised and low-shot setting for future works to compare against, and we discuss open research questions that the PBSCR data is especially well suited to facilitate research on.
Paper Structure (22 sections, 1 equation, 6 figures, 6 tables)

This paper contains 22 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Two examples of a piano sheet music excerpt (left) and corresponding bootleg score representation (right). Staff lines are not encoded in the bootleg score representation itself, but they are overlaid in the examples above as a visual reference.
  • Figure 2: Examples of non-music filler pages and their extracted (gibberish) bootleg scores.
  • Figure 3: Histogram of the number of bootleg score events in a set of manually labeled music pages (top) and non-music pages (bottom).
  • Figure 4: Predicted probability of an ensembled classifier that classifies validation pages as filler (non-music) vs non-filler. We use a hard threshold of 0.5 to ensure that filler pages are excluded from our dataset with high confidence.
  • Figure 5: (Top) The total number of pieces/works available on IMSLP for the composers in the 100-class dataset. (Bottom) The total number of bootleg score events for each composer in the 100-class dataset. The list of composers sorted by number of works can be found at https://github.com/HMC-MIR/PBSCR/blob/main/forPaper/composers_sorted_numpieces.txt.
  • ...and 1 more figures