Table of Contents
Fetching ...

End-to-end Piano Performance-MIDI to Score Conversion with Transformers

Tim Beyer, Angela Dai

TL;DR

This work presents an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files and is the first to directly predict notational details like trill marks or stem direction from performance data.

Abstract

The automated creation of accurate musical notation from an expressive human performance is a fundamental task in computational musicology. To this end, we present an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files. We introduce a modern transformer-based architecture with a novel tokenized representation for symbolic music data. Framing the task as sequence-to-sequence translation rather than note-wise classification reduces alignment requirements and annotation costs, while allowing the prediction of more concise and accurate notation. To serialize symbolic music data, we design a custom tokenization stage based on compound tokens that carefully quantizes continuous values. This technique preserves more score information while reducing sequence lengths by $3.5\times$ compared to prior approaches. Using the transformer backbone, our method demonstrates better understanding of note values, rhythmic structure, and details such as staff assignment. When evaluated end-to-end using transcription metrics such as MUSTER, we achieve significant improvements over previous deep learning approaches and complex HMM-based state-of-the-art pipelines. Our method is also the first to directly predict notational details like trill marks or stem direction from performance data. Code and models are available at https://github.com/TimFelixBeyer/MIDI2ScoreTransformer

End-to-end Piano Performance-MIDI to Score Conversion with Transformers

TL;DR

This work presents an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files and is the first to directly predict notational details like trill marks or stem direction from performance data.

Abstract

The automated creation of accurate musical notation from an expressive human performance is a fundamental task in computational musicology. To this end, we present an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files. We introduce a modern transformer-based architecture with a novel tokenized representation for symbolic music data. Framing the task as sequence-to-sequence translation rather than note-wise classification reduces alignment requirements and annotation costs, while allowing the prediction of more concise and accurate notation. To serialize symbolic music data, we design a custom tokenization stage based on compound tokens that carefully quantizes continuous values. This technique preserves more score information while reducing sequence lengths by compared to prior approaches. Using the transformer backbone, our method demonstrates better understanding of note values, rhythmic structure, and details such as staff assignment. When evaluated end-to-end using transcription metrics such as MUSTER, we achieve significant improvements over previous deep learning approaches and complex HMM-based state-of-the-art pipelines. Our method is also the first to directly predict notational details like trill marks or stem direction from performance data. Code and models are available at https://github.com/TimFelixBeyer/MIDI2ScoreTransformer
Paper Structure (18 sections, 6 equations, 2 figures, 7 tables)

This paper contains 18 sections, 6 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Model architecture overview. We use a standard Roformer encoder-decoder model su2021roformer with custom token embedding and projection layers. Each token stream is embedded separately, then a constant-size shared embedding is created via summation. The backbone model architecture remains unchanged compared to models applied to NLP or other sequence-to-sequence learning tasks. In this illustration, depth symbolizes the time direction.
  • Figure :