Table of Contents
Fetching ...

Practical End-to-End Optical Music Recognition for Pianoform Music

Jiří Mayer, Milan Straka, Jan Hajič, Pavel Pecina

TL;DR

A sequential format called Linearized MusicXML is defined, allowing to train an end-to-end model directly and maintaining close cohesion and compatibility with the industry-standard MusicXML format, and is tested against the recently published synthetic pianoform dataset GrandStaff and surpass the state-of-the-art results.

Abstract

The majority of recent progress in Optical Music Recognition (OMR) has been achieved with Deep Learning methods, especially models following the end-to-end paradigm, reading input images and producing a linear sequence of tokens. Unfortunately, many music scores, especially piano music, cannot be easily converted to a linear sequence. This has led OMR researchers to use custom linearized encodings, instead of broadly accepted structured formats for music notation. Their diversity makes it difficult to compare the performance of OMR systems directly. To bring recent OMR model progress closer to useful results: (a) We define a sequential format called Linearized MusicXML, allowing to train an end-to-end model directly and maintaining close cohesion and compatibility with the industry-standard MusicXML format. (b) We create a dev and test set for benchmarking typeset OMR with MusicXML ground truth based on the OpenScore Lieder corpus. They contain 1,438 and 1,493 pianoform systems, each with an image from IMSLP. (c) We train and fine-tune an end-to-end model to serve as a baseline on the dataset and employ the TEDn metric to evaluate the model. We also test our model against the recently published synthetic pianoform dataset GrandStaff and surpass the state-of-the-art results.

Practical End-to-End Optical Music Recognition for Pianoform Music

TL;DR

A sequential format called Linearized MusicXML is defined, allowing to train an end-to-end model directly and maintaining close cohesion and compatibility with the industry-standard MusicXML format, and is tested against the recently published synthetic pianoform dataset GrandStaff and surpass the state-of-the-art results.

Abstract

The majority of recent progress in Optical Music Recognition (OMR) has been achieved with Deep Learning methods, especially models following the end-to-end paradigm, reading input images and producing a linear sequence of tokens. Unfortunately, many music scores, especially piano music, cannot be easily converted to a linear sequence. This has led OMR researchers to use custom linearized encodings, instead of broadly accepted structured formats for music notation. Their diversity makes it difficult to compare the performance of OMR systems directly. To bring recent OMR model progress closer to useful results: (a) We define a sequential format called Linearized MusicXML, allowing to train an end-to-end model directly and maintaining close cohesion and compatibility with the industry-standard MusicXML format. (b) We create a dev and test set for benchmarking typeset OMR with MusicXML ground truth based on the OpenScore Lieder corpus. They contain 1,438 and 1,493 pianoform systems, each with an image from IMSLP. (c) We train and fine-tune an end-to-end model to serve as a baseline on the dataset and employ the TEDn metric to evaluate the model. We also test our model against the recently published synthetic pianoform dataset GrandStaff and surpass the state-of-the-art results.
Paper Structure (7 sections, 6 figures, 3 tables)

This paper contains 7 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Typology of music notation by complexity. Monophonic scores (a) are straightforwardly encoded as sequences; for (b) homophonic scores (chords allowed, but all simultaneous notes have the same length), advance coding has been used with promising results still with CTC objective AlfaroContreras2023. For (c) polyphony, linearization becomes necessary, and (d) pianoform music adds interaction between staffs within one grand staff and generally contains the greatest density of objects.
  • Figure 2: One measure -- 246 lines of MusicXML represented only by 96 tokens of Linearized MusicXML (formatting and indentation is present only for better readability).
  • Figure 3: Comparison of a synthetic and scanned sample. Notice the different bass clef style and measure width.
  • Figure 4: Architecture of our model.
  • Figure 5: An exemplary part of a system and its four random augmentations.
  • ...and 1 more figures