Table of Contents
Fetching ...

Cluster and Separate: a GNN Approach to Voice and Staff Prediction for Score Engraving

Francesco Foscarin, Emmanouil Karystinaios, Eita Nakamura, Gerhard Widmer

TL;DR

The paper tackles voice and staff separation for piano score engraving from symbolic music by modeling notes as a graph and learning edge-based relationships with a graph neural network. It reframes voice separation as output-edge prediction for chords and voice connections, enabling an unbounded number of voices and cross-staff handling, followed by a postprocessing stage that enforces engraving rules. The method outperforms baselines on two diverse piano datasets and provides an MEI exportable, visualization-enabled workflow, including cross-staff voice demonstrations. This work advances practical engraving from symbolic inputs and offers a robust, end-to-end pipeline for producing readable scores with cross-staff voicing, beam planning, and rest insertion.

Abstract

This paper approaches the problem of separating the notes from a quantized symbolic music piece (e.g., a MIDI file) into multiple voices and staves. This is a fundamental part of the larger task of music score engraving (or score typesetting), which aims to produce readable musical scores for human performers. We focus on piano music and support homophonic voices, i.e., voices that can contain chords, and cross-staff voices, which are notably difficult tasks that have often been overlooked in previous research. We propose an end-to-end system based on graph neural networks that clusters notes that belong to the same chord and connects them with edges if they are part of a voice. Our results show clear and consistent improvements over a previous approach on two datasets of different styles. To aid the qualitative analysis of our results, we support the export in symbolic music formats and provide a direct visualization of our outputs graph over the musical score. All code and pre-trained models are available at https://github.com/CPJKU/piano_svsep

Cluster and Separate: a GNN Approach to Voice and Staff Prediction for Score Engraving

TL;DR

The paper tackles voice and staff separation for piano score engraving from symbolic music by modeling notes as a graph and learning edge-based relationships with a graph neural network. It reframes voice separation as output-edge prediction for chords and voice connections, enabling an unbounded number of voices and cross-staff handling, followed by a postprocessing stage that enforces engraving rules. The method outperforms baselines on two diverse piano datasets and provides an MEI exportable, visualization-enabled workflow, including cross-staff voice demonstrations. This work advances practical engraving from symbolic inputs and offers a robust, end-to-end pipeline for producing readable scores with cross-staff voicing, beam planning, and rest insertion.

Abstract

This paper approaches the problem of separating the notes from a quantized symbolic music piece (e.g., a MIDI file) into multiple voices and staves. This is a fundamental part of the larger task of music score engraving (or score typesetting), which aims to produce readable musical scores for human performers. We focus on piano music and support homophonic voices, i.e., voices that can contain chords, and cross-staff voices, which are notably difficult tasks that have often been overlooked in previous research. We propose an end-to-end system based on graph neural networks that clusters notes that belong to the same chord and connects them with edges if they are part of a voice. Our results show clear and consistent improvements over a previous approach on two datasets of different styles. To aid the qualitative analysis of our results, we support the export in symbolic music formats and provide a direct visualization of our outputs graph over the musical score. All code and pre-trained models are available at https://github.com/CPJKU/piano_svsep
Paper Structure (16 sections, 2 equations, 5 figures, 2 tables)

This paper contains 16 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Comparing different voice/staff assignments for two bars from C. Debussy's Estampes - Pagodes. (top) original; voices can be inferred from the beam grouping and (horizontal lines connecting notes), rests, and stem sharing, and are colored for clarity. (bottom) hard-to-read rendition with voice and staff assigned according to heuristics we propose as a baseline.
  • Figure 2: Our Architecture. For simplification, we display the output graph as having "hard" voice predictions, while these are probabilities over voice candidates.
  • Figure 3: Output graph postprocessing. We do not display the predicted staff labels.
  • Figure 4: Comparison of voice and staff assignment between the original score (Ground Truth) and our method (GNN) on the first bars of C. Debussy's Estampes-Pagodes. Voice edges are drawn in red and chord edges in blue.
  • Figure :