Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription
Antonio Ríos-Vila, Jorge Calvo-Zaragoza, Thierry Paquet
TL;DR
The paper tackles the challenge of end-to-end optical music recognition for polyphonic scores by introducing the Sheet Music Transformer (SMT), an autoregressive image-to-sequence model with a CNN-based encoder and Transformer decoder that transcribes sheet music into the Humdrum kern encoding. SMT preserves multi-voice structures through a two-dimensional positional encoding and a 2D-aware feature representation, avoiding the monophony-oriented vertical collapse of prior methods. Evaluated on GrandStaff and Quartets datasets, SMT notably outperforms state-of-the-art approaches, with the SMT_NexT backbone achieving substantial improvements in CER, SER, and LER, and demonstrating higher renderability of transcriptions for musicological tools. This work advances end-to-end OMR toward robust polyphonic transcription and suggests future directions toward segmentation-free full-page transcription and universal OMR capabilities, with tangible benefits for music analysis and digitization workflows.
Abstract
State-of-the-art end-to-end Optical Music Recognition (OMR) has, to date, primarily been carried out using monophonic transcription techniques to handle complex score layouts, such as polyphony, often by resorting to simplifications or specific adaptations. Despite their efficacy, these approaches imply challenges related to scalability and limitations. This paper presents the Sheet Music Transformer, the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies. Our model employs a Transformer-based image-to-sequence framework that predicts score transcriptions in a standard digital music encoding format from input images. Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively. The experimental outcomes not only indicate the competence of the model, but also show that it is better than the state-of-the-art methods, thus contributing to advancements in end-to-end OMR transcription.
