Table of Contents
Fetching ...

Detecting Music Performance Errors with Transformers

Benjamin Shiue-Hal Chou, Purvish Jajal, Nicholas John Eliopoulos, Tim Nadolsky, Cheng-Yun Yang, Nikita Ravi, James C. Davis, Kristen Yeon-Ji Yun, Yung-Hsiang Lu

TL;DR

This work tackles the challenge of providing fine-grained feedback for music performance errors in beginner musicians, where prior tools rely on brittle alignment and offer limited error types. It introduces Polytune, an end-to-end transformer that ingests audio from both a musical score and a performance to output annotated, MIDI-like tokens without explicit alignment, and it leverages large-scale synthetic datasets MAESTRO-E and CocoChorales-E for training. Polytune achieves state-of-the-art performance with an average Error Detection F1 of $64.1\%$ across 14 instruments, significantly outperforming alignment-based baselines and enabling multi-instrument error detection. The approach demonstrates the value of end-to-end latent alignment and synthetic data for scalable, fine-grained feedback in music education, and it provides open-source code and data resources for further research.

Abstract

Beginner musicians often struggle to identify specific errors in their performances, such as playing incorrect notes or rhythms. There are two limitations in existing tools for music error detection: (1) Existing approaches rely on automatic alignment; therefore, they are prone to errors caused by small deviations between alignment targets.; (2) There is a lack of sufficient data to train music error detection models, resulting in over-reliance on heuristics. To address (1), we propose a novel transformer model, Polytune, that takes audio inputs and outputs annotated music scores. This model can be trained end-to-end to implicitly align and compare performance audio with music scores through latent space representations. To address (2), we present a novel data generation technique capable of creating large-scale synthetic music error datasets. Our approach achieves a 64.1% average Error Detection F1 score, improving upon prior work by 40 percentage points across 14 instruments. Additionally, compared with existing transcription methods repurposed for music error detection, our model can handle multiple instruments. Our source code and datasets are available at https://github.com/ben2002chou/Polytune.

Detecting Music Performance Errors with Transformers

TL;DR

This work tackles the challenge of providing fine-grained feedback for music performance errors in beginner musicians, where prior tools rely on brittle alignment and offer limited error types. It introduces Polytune, an end-to-end transformer that ingests audio from both a musical score and a performance to output annotated, MIDI-like tokens without explicit alignment, and it leverages large-scale synthetic datasets MAESTRO-E and CocoChorales-E for training. Polytune achieves state-of-the-art performance with an average Error Detection F1 of across 14 instruments, significantly outperforming alignment-based baselines and enabling multi-instrument error detection. The approach demonstrates the value of end-to-end latent alignment and synthetic data for scalable, fine-grained feedback in music education, and it provides open-source code and data resources for further research.

Abstract

Beginner musicians often struggle to identify specific errors in their performances, such as playing incorrect notes or rhythms. There are two limitations in existing tools for music error detection: (1) Existing approaches rely on automatic alignment; therefore, they are prone to errors caused by small deviations between alignment targets.; (2) There is a lack of sufficient data to train music error detection models, resulting in over-reliance on heuristics. To address (1), we propose a novel transformer model, Polytune, that takes audio inputs and outputs annotated music scores. This model can be trained end-to-end to implicitly align and compare performance audio with music scores through latent space representations. To address (2), we present a novel data generation technique capable of creating large-scale synthetic music error datasets. Our approach achieves a 64.1% average Error Detection F1 score, improving upon prior work by 40 percentage points across 14 instruments. Additionally, compared with existing transcription methods repurposed for music error detection, our model can handle multiple instruments. Our source code and datasets are available at https://github.com/ben2002chou/Polytune.
Paper Structure (21 sections, 1 equation, 5 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 1 equation, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: The score on top is the performance transcription and the score below is the reference. This paper can detect three types of errors. (1) is an extra note. “C” is played, but it is not expected by the score. (2) is a missed note. In this figure, “C” is not played. (3) is a wrong note, which is actually just a missed note and an extra note happening at the same time. We expect the player to play a “C” but instead a “B” is played. The model inputs the score and the music student's recorded audio, and labels the notes detected from the recording as “Missed”, “Extra”, or “Correct”.
  • Figure 2: Illustration of differences between (a) previous work and (b) our Polytune . Our approach simplifies music error detection by using an end-to-end trainable architecture. This eliminates the need to explicitly align and compare audio, which is error-prone.
  • Figure 3: Deficiency of Dynamic Time Warping (DTW): DTW encounters challenges when aligning complex sequences of overlapping notes in music. Specifically, when DTW aligns one note within a group, it often compromises the alignment of other notes. This issue arises because DTW attempts to minimize timing differences across the entire sequence. Once the MIDI score is aligned with the audio, the algorithm may align one note correctly but misalign others, leading to potential classification errors. For example, as seen in the overlay of score and audio, aligning an "A" note might result in the misalignment of the adjacent "C#" note, causing the algorithm to mistakenly classify a correctly played note as a missed "C#" and then an extra "C#".
  • Figure 4: Architecture of Polytune. The diagram illustrates the process flow starting with the Score and Performance Audio inputs, each processed through dedicated AST encoders. These encoded features are concatenated and passed through a joint encoder and a decoder with cross-attention for temporal sequencing. The output is generated through greedy autoregressive sampling, providing MIDI-like tokens that classify notes as correct or missing.
  • Figure 5: Qualitative Comparison of MIDI Note Events: This figure shows the "correct category" note events detected by each model for a track in the CocoChorales-E dataset. Ground truth notes are filled blue rectangles, and model predictions are black-outlined rectangles. Music note 1 is caused by a transcription error by MT3 and a similar error occurs with Music note 2. Overall, Polytune better matches the ground truth and has fewer false detections than our Benetos and Wang re-implementation.