Table of Contents
Fetching ...

MusicScore: A Dataset for Music Score Modeling and Generation

Yuheng Lin, Zheqi Dai, Qiuqiang Kong

TL;DR

MusicScore tackles the lack of large-scale music score benchmarks by constructing image-text score-page pairs sourced from IMSLP and processed into small, medium, and large subsets. It links high-quality score images with rich metadata in JSON, enabling text-driven score generation experiments. A diffusion-based system (VAE–OpenCLIP–UNet) is developed to generate playable score images conditioned on textual descriptions, achieving measurable connectivity via FID scores across subsets. The dataset and tooling are released publicly to spur advances in music score modeling, with future work aiming at cross-modal integration and a MusicScore-CLIP model.

Abstract

Music scores are written representations of music and contain rich information about musical components. The visual information on music scores includes notes, rests, staff lines, clefs, dynamics, and articulations. This visual information in music scores contains more semantic information than audio and symbolic representations of music. Previous music score datasets have limited sizes and are mainly designed for optical music recognition (OMR). There is a lack of research on creating a large-scale benchmark dataset for music modeling and generation. In this work, we propose MusicScore, a large-scale music score dataset collected and processed from the International Music Score Library Project (IMSLP). MusicScore consists of image-text pairs, where the image is a page of a music score and the text is the metadata of the music. The metadata of MusicScore is extracted from the general information section of the IMSLP pages. The metadata includes rich information about the composer, instrument, piece style, and genre of the music pieces. MusicScore is curated into small, medium, and large scales of 400, 14k, and 200k image-text pairs with varying diversity, respectively. We build a score generation system based on a UNet diffusion model to generate visually readable music scores conditioned on text descriptions to benchmark the MusicScore dataset for music score generation. MusicScore is released to the public at https://huggingface.co/datasets/ZheqiDAI/MusicScore.

MusicScore: A Dataset for Music Score Modeling and Generation

TL;DR

MusicScore tackles the lack of large-scale music score benchmarks by constructing image-text score-page pairs sourced from IMSLP and processed into small, medium, and large subsets. It links high-quality score images with rich metadata in JSON, enabling text-driven score generation experiments. A diffusion-based system (VAE–OpenCLIP–UNet) is developed to generate playable score images conditioned on textual descriptions, achieving measurable connectivity via FID scores across subsets. The dataset and tooling are released publicly to spur advances in music score modeling, with future work aiming at cross-modal integration and a MusicScore-CLIP model.

Abstract

Music scores are written representations of music and contain rich information about musical components. The visual information on music scores includes notes, rests, staff lines, clefs, dynamics, and articulations. This visual information in music scores contains more semantic information than audio and symbolic representations of music. Previous music score datasets have limited sizes and are mainly designed for optical music recognition (OMR). There is a lack of research on creating a large-scale benchmark dataset for music modeling and generation. In this work, we propose MusicScore, a large-scale music score dataset collected and processed from the International Music Score Library Project (IMSLP). MusicScore consists of image-text pairs, where the image is a page of a music score and the text is the metadata of the music. The metadata of MusicScore is extracted from the general information section of the IMSLP pages. The metadata includes rich information about the composer, instrument, piece style, and genre of the music pieces. MusicScore is curated into small, medium, and large scales of 400, 14k, and 200k image-text pairs with varying diversity, respectively. We build a score generation system based on a UNet diffusion model to generate visually readable music scores conditioned on text descriptions to benchmark the MusicScore dataset for music score generation. MusicScore is released to the public at https://huggingface.co/datasets/ZheqiDAI/MusicScore.
Paper Structure (32 sections, 9 figures, 2 tables)

This paper contains 32 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: An example of an A major violin score demonstrates the fundamental elements of staff notation, including clefs, accidentals, dynamics and other performance techniques notation for bowed string instruments.
  • Figure 2: MusicScore dataset collecting and processing pipeline.
  • Figure 3: The first column: low-quality score with yellow pages. The second column: low-quality score with too much white spaces and noisy point. The third and fourth columns: high-quality scores.
  • Figure 4: Left: metadata of "Muisca - El Dorado" by Michael Maxwell Steer. Right: The first page score.
  • Figure 5: The data statistics of music scores in the MusicScore-200k dataset.
  • ...and 4 more figures