Table of Contents
Fetching ...

PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text

Hayeon Bang, Eunjin Choi, Megan Finch, Seungheon Doh, Seolhee Lee, Gyeong-Hoon Lee, Juhan Nam

TL;DR

This work presents PIAST (PIano dataset with Audio, Symbolic, and Text), a piano music dataset that includes audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models.

Abstract

While piano music has become a significant area of study in Music Information Retrieval (MIR), there is a notable lack of datasets for piano solo music with text labels. To address this gap, we present PIAST (PIano dataset with Audio, Symbolic, and Text), a piano music dataset. Utilizing a piano-specific taxonomy of semantic tags, we collected 9,673 tracks from YouTube and added human annotations for 2,023 tracks by music experts, resulting in two subsets: PIAST-YT and PIAST-AT. Both include audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models. Among many possible tasks with the multi-modal dataset, we conduct music tagging and retrieval using both audio and MIDI data and report baseline performances to demonstrate its potential as a valuable resource for MIR research.

PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text

TL;DR

This work presents PIAST (PIano dataset with Audio, Symbolic, and Text), a piano music dataset that includes audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models.

Abstract

While piano music has become a significant area of study in Music Information Retrieval (MIR), there is a notable lack of datasets for piano solo music with text labels. To address this gap, we present PIAST (PIano dataset with Audio, Symbolic, and Text), a piano music dataset. Utilizing a piano-specific taxonomy of semantic tags, we collected 9,673 tracks from YouTube and added human annotations for 2,023 tracks by music experts, resulting in two subsets: PIAST-YT and PIAST-AT. Both include audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models. Among many possible tasks with the multi-modal dataset, we conduct music tagging and retrieval using both audio and MIDI data and report baseline performances to demonstrate its potential as a valuable resource for MIR research.

Paper Structure

This paper contains 13 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Top 50 words most frequently appearing in the text dataset of the PIAST-YT.
  • Figure 2: Tag distribution of the PIAST-AT dataset. Three distinct represent the degree of consensus. (Darkest: n=3, Medium: n=2, Lightest: n=1)
  • Figure 3: Annotation interface used in the PIAST-AT dataset.
  • Figure 4: Co-occurrence between tags in the PIAST-AT dataset.
  • Figure 5: ROC-AUC and PR-AUC scores for each tag in tag-to-music retrieval performance. The darker bars represent audio performance, while the lighter bars represent MIDI performance.