Table of Contents
Fetching ...

Just Label the Repeats for In-The-Wild Audio-to-Score Alignment

Irmak Bukey, Michael Feffer, Chris Donahue

TL;DR

An evaluation protocol for audio-to-score alignment that computes the distance between the estimated and ground truth alignment in units of measures and finds that the proposed jump annotation workflow and improved feature representations together improve alignment accuracy by 150% relative to prior work.

Abstract

We propose an efficient workflow for high-quality offline alignment of in-the-wild performance audio and corresponding sheet music scans (images). Recent work on audio-to-score alignment extends dynamic time warping (DTW) to be theoretically able to handle jumps in sheet music induced by repeat signs-this method requires no human annotations, but we show that it often yields low-quality alignments. As an alternative, we propose a workflow and interface that allows users to quickly annotate jumps (by clicking on repeat signs), requiring a small amount of human supervision but yielding much higher quality alignments on average. Additionally, we refine audio and score feature representations to improve alignment quality by: (1) integrating measure detection into the score feature representation, and (2) using raw onset prediction probabilities from a music transcription model instead of piano roll. We propose an evaluation protocol for audio-to-score alignment that computes the distance between the estimated and ground truth alignment in units of measures. Under this evaluation, we find that our proposed jump annotation workflow and improved feature representations together improve alignment accuracy by 150% relative to prior work (33% to 82%).

Just Label the Repeats for In-The-Wild Audio-to-Score Alignment

TL;DR

An evaluation protocol for audio-to-score alignment that computes the distance between the estimated and ground truth alignment in units of measures and finds that the proposed jump annotation workflow and improved feature representations together improve alignment accuracy by 150% relative to prior work.

Abstract

We propose an efficient workflow for high-quality offline alignment of in-the-wild performance audio and corresponding sheet music scans (images). Recent work on audio-to-score alignment extends dynamic time warping (DTW) to be theoretically able to handle jumps in sheet music induced by repeat signs-this method requires no human annotations, but we show that it often yields low-quality alignments. As an alternative, we propose a workflow and interface that allows users to quickly annotate jumps (by clicking on repeat signs), requiring a small amount of human supervision but yielding much higher quality alignments on average. Additionally, we refine audio and score feature representations to improve alignment quality by: (1) integrating measure detection into the score feature representation, and (2) using raw onset prediction probabilities from a music transcription model instead of piano roll. We propose an evaluation protocol for audio-to-score alignment that computes the distance between the estimated and ground truth alignment in units of measures. Under this evaluation, we find that our proposed jump annotation workflow and improved feature representations together improve alignment accuracy by 150% relative to prior work (33% to 82%).

Paper Structure

This paper contains 21 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An overview of the task of audio-to-score alignment and our proposed approach. Given a score image (as a PDF) and corresponding performance audio (e.g., an MP3) as input, the task involves outputting an alignment between time in the recording and playheads in the score image. A key challenge in this task is handling jumps in the score, e.g., those created by repeat signs. In lieu of robust automatic methods for detecting or handling jumps, we propose a pragmatic approach of having experts simply label the repeats, which can be done quickly and greatly improves task performance. Our proposed system combines the repeat labels with score feature representations inspired by past work on bootleg scores yang2020midi. This score representation is aligned with audio feature representations inspired by maman2022unaligned using ordinary DTW.
  • Figure 2: A score playhead (blue line), the output of an audio-to-score alignment, is characterized by its vertical offset ($y$), horizontal offset ($x$), and height ($h$), all relative to the page. A measure-aware alignment is indexed by $m$, a fractional measure, that can be converted to a score playhead by lookup and interpolation in a list of bounding boxes (brown outlines). Our measure-aware evaluation compares estimated playheads $m'$ to ground truth $m^*$.
  • Figure 3: Illustration of our web-based interface for labeling jumps (e.g. repeats) in scores. Our interface enables rapid jump annotation (just seconds per page after training), which we find to dramatically improve alignment quality on pieces with jumps.
  • Figure :