Table of Contents
Fetching ...

Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription

Antonio Ríos-Vila, Jorge Calvo-Zaragoza, Thierry Paquet

TL;DR

The paper tackles the challenge of end-to-end optical music recognition for polyphonic scores by introducing the Sheet Music Transformer (SMT), an autoregressive image-to-sequence model with a CNN-based encoder and Transformer decoder that transcribes sheet music into the Humdrum kern encoding. SMT preserves multi-voice structures through a two-dimensional positional encoding and a 2D-aware feature representation, avoiding the monophony-oriented vertical collapse of prior methods. Evaluated on GrandStaff and Quartets datasets, SMT notably outperforms state-of-the-art approaches, with the SMT_NexT backbone achieving substantial improvements in CER, SER, and LER, and demonstrating higher renderability of transcriptions for musicological tools. This work advances end-to-end OMR toward robust polyphonic transcription and suggests future directions toward segmentation-free full-page transcription and universal OMR capabilities, with tangible benefits for music analysis and digitization workflows.

Abstract

State-of-the-art end-to-end Optical Music Recognition (OMR) has, to date, primarily been carried out using monophonic transcription techniques to handle complex score layouts, such as polyphony, often by resorting to simplifications or specific adaptations. Despite their efficacy, these approaches imply challenges related to scalability and limitations. This paper presents the Sheet Music Transformer, the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies. Our model employs a Transformer-based image-to-sequence framework that predicts score transcriptions in a standard digital music encoding format from input images. Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively. The experimental outcomes not only indicate the competence of the model, but also show that it is better than the state-of-the-art methods, thus contributing to advancements in end-to-end OMR transcription.

Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription

TL;DR

The paper tackles the challenge of end-to-end optical music recognition for polyphonic scores by introducing the Sheet Music Transformer (SMT), an autoregressive image-to-sequence model with a CNN-based encoder and Transformer decoder that transcribes sheet music into the Humdrum kern encoding. SMT preserves multi-voice structures through a two-dimensional positional encoding and a 2D-aware feature representation, avoiding the monophony-oriented vertical collapse of prior methods. Evaluated on GrandStaff and Quartets datasets, SMT notably outperforms state-of-the-art approaches, with the SMT_NexT backbone achieving substantial improvements in CER, SER, and LER, and demonstrating higher renderability of transcriptions for musicological tools. This work advances end-to-end OMR toward robust polyphonic transcription and suggests future directions toward segmentation-free full-page transcription and universal OMR capabilities, with tangible benefits for music analysis and digitization workflows.

Abstract

State-of-the-art end-to-end Optical Music Recognition (OMR) has, to date, primarily been carried out using monophonic transcription techniques to handle complex score layouts, such as polyphony, often by resorting to simplifications or specific adaptations. Despite their efficacy, these approaches imply challenges related to scalability and limitations. This paper presents the Sheet Music Transformer, the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies. Our model employs a Transformer-based image-to-sequence framework that predicts score transcriptions in a standard digital music encoding format from input images. Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively. The experimental outcomes not only indicate the competence of the model, but also show that it is better than the state-of-the-art methods, thus contributing to advancements in end-to-end OMR transcription.
Paper Structure (20 sections, 3 equations, 7 figures, 2 tables)

This paper contains 20 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Graphic scheme of the SMT architecture.
  • Figure 2: Example of an excerpt from Humdrum **kern-encoded pianoform music. The blue dashed line represents a **kern spine, which is a voice in the musical score, in this case, a staff. The green box represents simultaneous notes during interpretation, which are represented as a text line in the ground truth. The red box is a single musical symbol (in this case, a note). The ground truth is read top to bottom and left to right, while the score is read from left to right and bottom to top.
  • Figure 3: Examples of the data contained in the GrandStaff and Camera GrandStaff dataset.
  • Figure 4: Examples of an excerpt of music from the Quartets dataset.
  • Figure 5: Test example from the GrandStaff dataset with the errors highlighted. This specific sample attained a CER of 5.0%, a SER of 6.0% and a LER of 20.5%.
  • ...and 2 more figures