Table of Contents
Fetching ...

Toward a More Complete OMR Solution

Guang Yang, Muru Zhang, Lin Qiu, Yanming Wan, Noah A. Smith

TL;DR

This study introduces a music object detector based on YOLOv8, which improves detection performance and introduces a supervised training pipeline that completes the notation assembly stage based on detection output, and finds that this model is able to outperform existing models trained on perfect detection output.

Abstract

Optical music recognition (OMR) aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image (object detection) and then assembles them into a music notation (notation assembly). Most previous work on notation assembly unrealistically assumes perfect object detection. In this study, we focus on the MUSCIMA++ v2.0 dataset, which represents musical notation as a graph with pairwise relationships among detected music objects, and we consider both stages together. First, we introduce a music object detector based on YOLOv8, which improves detection performance. Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output. We find that this model is able to outperform existing models trained on perfect detection output, showing the benefit of considering the detection and assembly stages in a more holistic way. These findings, together with our novel evaluation metric, are important steps toward a more complete OMR solution.

Toward a More Complete OMR Solution

TL;DR

This study introduces a music object detector based on YOLOv8, which improves detection performance and introduces a supervised training pipeline that completes the notation assembly stage based on detection output, and finds that this model is able to outperform existing models trained on perfect detection output.

Abstract

Optical music recognition (OMR) aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image (object detection) and then assembles them into a music notation (notation assembly). Most previous work on notation assembly unrealistically assumes perfect object detection. In this study, we focus on the MUSCIMA++ v2.0 dataset, which represents musical notation as a graph with pairwise relationships among detected music objects, and we consider both stages together. First, we introduce a music object detector based on YOLOv8, which improves detection performance. Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output. We find that this model is able to outperform existing models trained on perfect detection output, showing the benefit of considering the detection and assembly stages in a more holistic way. These findings, together with our novel evaluation metric, are important steps toward a more complete OMR solution.
Paper Structure (20 sections, 4 equations, 6 figures, 2 tables)

This paper contains 20 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: An overview of our OMR pipeline, highlighting key components: object detection, notation assembly, and evaluation metric. Detailed explanations of each component can be found in Subsections \ref{['subsec:detection']}, \ref{['subsec:assembly']}, and \ref{['subsec:e2e_eval']} respectively.
  • Figure 2: Example of a music image (binarized) extracted from the MUSCIMA++ dataset.
  • Figure 3: Frequencies of different classes in the dataset, from most- to least-frequent. A long-tailed distribution with 48 classes on the right of the red line that never appear. The $y$-axis shows the value of $\ln{(\text{frequency}+1)}$. The top-5 classes are stem, nodeheadFull, ledgerLine, beam, and staffSpace.
  • Figure 4: An example of detected objects and predicted graph, alongside ground truth. At the right is the constructed bipartite graph (zero-weight edges not shown). Thick edges represent the matching function $\mathcal{M}$ induced by the matching algorithm. In our notation, $E = \{(v_2, v_1), (v_3, v_1), (v_4, v_1)\}$ and the matching function maps $v_1$ to $\tilde{v}_1$, $v_2$ to $\tilde{v}_2$ and $v_4$ to $\tilde{v_4}$. Therefore, $\hat{E} = \{(\tilde{v}_2, \tilde{v}_1), (\tilde{v}_4, \tilde{v}_1)\}$. Because $\tilde{E} = \{(\tilde{v}_2, \tilde{v}_1), (\tilde{v}_2, \tilde{v}_4), (\tilde{v}_3, \tilde{v}_1), (\tilde{v}_4, \tilde{v}_1)\}$, we get a precision of 0.5 and recall of 1.0.
  • Figure 5: Example of music symbol detection segments for inference. The thick red line indicates the primary cropped area, while the thick blue line represents an extended cropped section designed to include partial symbols that may extend beyond the main cropped area. For better visualization, we only show the extended area of one image crop. Image crops on the right and bottom border of the page are padded to fit into YOLOv8.
  • ...and 1 more figures