Table of Contents
Fetching ...

TheGlueNote: Learned Representations for Robust and Flexible Note Alignment

Silvan David Peter, Gerhard Widmer

TL;DR

This work tackles robust symbolic note alignment between two versions of a MIDI piece, focusing on large mismatches such as repeats, skips, and ornamentations. It introduces TheGlueNote, a transformer-based encoder that learns note-wise representations from 512-note windows and outputs a 513-note similarity matrix used to identify matches, with three post-processing options including DTW. Trained on synthetically augmented MIDI data, TheGlueNote achieves competitive state-of-the-art performance and shows strong robustness to mismatches while operating directly on plain MIDI without requiring quantization or score annotations. Ablation studies highlight the effectiveness of DTW-based post-processing with learned representations, offering favorable runtime. This approach advances robust symbolic note alignment and opens avenues for end-to-end and cross-domain extensions.

Abstract

Note alignment refers to the task of matching individual notes of two versions of the same symbolically encoded piece. Methods addressing this task commonly rely on sequence alignment algorithms such as Hidden Markov Models or Dynamic Time Warping (DTW) applied directly to note or onset sequences. While successful in many cases, such methods struggle with large mismatches between the versions. In this work, we learn note-wise representations from data augmented with various complex mismatch cases, e.g. repeats, skips, block insertions, and long trills. At the heart of our approach lies a transformer encoder network - TheGlueNote - which predicts pairwise note similarities for two 512 note subsequences. We postprocess the predicted similarities using flavors of weightedDTW and pitch-separated onsetDTW to retrieve note matches for two sequences of arbitrary length. Our approach performs on par with the state of the art in terms of note alignment accuracy, is considerably more robust to version mismatches, and works directly on any pair of MIDI files.

TheGlueNote: Learned Representations for Robust and Flexible Note Alignment

TL;DR

This work tackles robust symbolic note alignment between two versions of a MIDI piece, focusing on large mismatches such as repeats, skips, and ornamentations. It introduces TheGlueNote, a transformer-based encoder that learns note-wise representations from 512-note windows and outputs a 513-note similarity matrix used to identify matches, with three post-processing options including DTW. Trained on synthetically augmented MIDI data, TheGlueNote achieves competitive state-of-the-art performance and shows strong robustness to mismatches while operating directly on plain MIDI without requiring quantization or score annotations. Ablation studies highlight the effectiveness of DTW-based post-processing with learned representations, offering favorable runtime. This approach advances robust symbolic note alignment and opens avenues for end-to-end and cross-domain extensions.

Abstract

Note alignment refers to the task of matching individual notes of two versions of the same symbolically encoded piece. Methods addressing this task commonly rely on sequence alignment algorithms such as Hidden Markov Models or Dynamic Time Warping (DTW) applied directly to note or onset sequences. While successful in many cases, such methods struggle with large mismatches between the versions. In this work, we learn note-wise representations from data augmented with various complex mismatch cases, e.g. repeats, skips, block insertions, and long trills. At the heart of our approach lies a transformer encoder network - TheGlueNote - which predicts pairwise note similarities for two 512 note subsequences. We postprocess the predicted similarities using flavors of weightedDTW and pitch-separated onsetDTW to retrieve note matches for two sequences of arbitrary length. Our approach performs on par with the state of the art in terms of note alignment accuracy, is considerably more robust to version mismatches, and works directly on any pair of MIDI files.
Paper Structure (14 sections, 2 figures, 4 tables)

This paper contains 14 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the proposed model. During training (top row), the data flows from Data Processing (black, top left) through TheGlueNote (blue, middle left) into the Decoder Head (blue, middle right) and the aggregated Loss (yellow, top right). Concretely, a MIDI file is loaded into the Data Processing module which outputs matching targets to the loss module and the concatenated sequences $s_1$ and $s_2$ to TheGlueNote. TheGlueNote (middle row) consist of a transformer encoder with learned positional embeddings (LPE) and repeated attention blocks (center module with multihead-attention MHA and a two-layer feedforward network 2L FF). The note-wise representations are split and multiplied for a pairwise similarity matrix with $s_1$ in the row and $s_2$ in the column dimension shown in the loss module. Two cross-entropy loss terms are computed from this matrix and it is also forwarded to the decoder head whose classifier output adds a third loss term. During inference (bottom row), two MIDI files to be matched are directly passed to TheGlueNote. The resulting similarity matrix can be processed in three ways: 1) direct maximal similarity match extraction (Matrix Match box) 2) using the decoder head's output, or 3) using a DTW-based match extraction (red, bottom right).
  • Figure :