Table of Contents
Fetching ...

LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR

Guang Yang, Victoria Ebert, Nazif Tamer, Brian Siyuan Zheng, Luiza Pozzobon, Noah A. Smith

TL;DR

Legato introduces a large-scale, end-to-end OMR model capable of processing multi-page typeset scores and generating ABC notation. It combines a frozen pretrained vision encoder with a transformer-based ABC decoder and a data-efficient BPE tokenizer trained on 238,386 image-ABC pairs from the PDMX-Synth dataset, enabling robust generalization across diverse score layouts. The work defines canonical ABC representation, develops a dual rendering pipeline for diverse visual styles, and evaluates on multiple datasets, achieving state-of-the-art results across TEDn and OMR-NED metrics, including challenging OpenScore and IMSLP piano scores. The approach demonstrates the practicality of end-to-end OMR for large-scale digitalization of musical scores and highlights the potential for NLP-friendly representations to facilitate downstream analysis and rendering.

Abstract

We propose Legato, a new end-to-end model for optical music recognition (OMR), a task of converting music score images to machine-readable documents. Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pretrained vision encoder with an ABC decoder trained on a dataset of more than 214K images, our model exhibits the strong ability to generalize across various typeset scores. We conduct comprehensive experiments on a range of datasets and metrics and demonstrate that Legato outperforms the previous state of the art. On our most realistic dataset, we see a 68\% and 47.6\% absolute error reduction on the standard metrics TEDn and OMR-NED, respectively.

LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR

TL;DR

Legato introduces a large-scale, end-to-end OMR model capable of processing multi-page typeset scores and generating ABC notation. It combines a frozen pretrained vision encoder with a transformer-based ABC decoder and a data-efficient BPE tokenizer trained on 238,386 image-ABC pairs from the PDMX-Synth dataset, enabling robust generalization across diverse score layouts. The work defines canonical ABC representation, develops a dual rendering pipeline for diverse visual styles, and evaluates on multiple datasets, achieving state-of-the-art results across TEDn and OMR-NED metrics, including challenging OpenScore and IMSLP piano scores. The approach demonstrates the practicality of end-to-end OMR for large-scale digitalization of musical scores and highlights the potential for NLP-friendly representations to facilitate downstream analysis and rendering.

Abstract

We propose Legato, a new end-to-end model for optical music recognition (OMR), a task of converting music score images to machine-readable documents. Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pretrained vision encoder with an ABC decoder trained on a dataset of more than 214K images, our model exhibits the strong ability to generalize across various typeset scores. We conduct comprehensive experiments on a range of datasets and metrics and demonstrate that Legato outperforms the previous state of the art. On our most realistic dataset, we see a 68\% and 47.6\% absolute error reduction on the standard metrics TEDn and OMR-NED, respectively.

Paper Structure

This paper contains 32 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Model architecture. The input image is first cropped into overlapping segments with an aspect ratio of 1:4 or less, then resized and divided into four patches (§\ref{['subsubsec:processing']}). The image patches are fed into a vision encoder (§\ref{['subsubsec:encoder']}; parameters are frozen during training). The resulting latent embeddings serve as cross-attention keys and values in a transformer decoder, which autoregressively generates ABC tokens (§\ref{['subsubsec:decoder']}). Special tokens <B>, <I>, and <E> denote <|begin_of_abc|>, <|image|>, and <|end_of_abc|>, respectively. For better visualization, here we use "_" to represent whitespace.
  • Figure 2: An example of our canonical ABC representation (below) with a MusicXML-rendered image (above).
  • Figure 3: Example vocabulary items from tokenization.
  • Figure 4: ABC error rates on PDMX-Synth test input with different aspect ratios. Error rates are reported by averaging over each bin. Legato is capable of recognizing multi-page scores.
  • Figure 5: Example (first system of Duetto No. 1 in E minor by Bach, BWV 802) from IMSLP Piano Scores (top), with output from Legato (middle) and SMT++ (bottom). Errors are marked in red boxes.
  • ...and 1 more figures