Table of Contents
Fetching ...

Peransformer: Improving Low-informed Expressive Performance Rendering with Score-aware Discriminator

Xian He, Wei Zeng, Ye Wang

TL;DR

Peransformer addresses the practicality gap in expressive performance rendering by combining a Transformer-based low-informed model with a score-aware discriminator, trained on a score-to-performance aligned MIDI dataset (ASAP-MIDI). It introduces Generalized EPR Metrics (GEM) to standardize evaluation and enable direct comparisons across EPR systems. Empirical results show state-of-the-art performance among low-informed models, with velocity and dynamics predictions approaching those of high-informed systems, and subjective listening tests corroborating the improvements. The work provides a unified evaluation workflow and dataset to facilitate reliable cross-model comparisons in MIDI-based EPR research.

Abstract

Highly-informed Expressive Performance Rendering (EPR) systems transform music scores with rich musical annotations into human-like expressive performance MIDI files. While these systems have achieved promising results, the availability of detailed music scores is limited compared to MIDI files and are less flexible to work with using a digital audio workstation (DAW). Recent advancements in low-informed EPR systems offer a more accessible alternative by directly utilizing score-derived MIDI as input, but these systems often exhibit suboptimal performance. Meanwhile, existing works are evaluated with diverse automatic metrics and data formats, hindering direct objective comparisons between EPR systems. In this study, we introduce Peransformer, a transformer-based low-informed EPR system designed to bridge the gap between low-informed and highly-informed EPR systems. Our approach incorporates a score-aware discriminator that leverages the underlying score-derived MIDI files and is trained on a score-to-performance paired, note-to-note aligned MIDI dataset. Experimental results demonstrate that Peransformer achieves state-of-the-art performance among low-informed systems, as validated by subjective evaluations. Furthermore, we extend existing automatic evaluation metrics for EPR systems and introduce generalized EPR metrics (GEM), enabling more direct, accurate, and reliable comparisons across EPR systems.

Peransformer: Improving Low-informed Expressive Performance Rendering with Score-aware Discriminator

TL;DR

Peransformer addresses the practicality gap in expressive performance rendering by combining a Transformer-based low-informed model with a score-aware discriminator, trained on a score-to-performance aligned MIDI dataset (ASAP-MIDI). It introduces Generalized EPR Metrics (GEM) to standardize evaluation and enable direct comparisons across EPR systems. Empirical results show state-of-the-art performance among low-informed models, with velocity and dynamics predictions approaching those of high-informed systems, and subjective listening tests corroborating the improvements. The work provides a unified evaluation workflow and dataset to facilitate reliable cross-model comparisons in MIDI-based EPR research.

Abstract

Highly-informed Expressive Performance Rendering (EPR) systems transform music scores with rich musical annotations into human-like expressive performance MIDI files. While these systems have achieved promising results, the availability of detailed music scores is limited compared to MIDI files and are less flexible to work with using a digital audio workstation (DAW). Recent advancements in low-informed EPR systems offer a more accessible alternative by directly utilizing score-derived MIDI as input, but these systems often exhibit suboptimal performance. Meanwhile, existing works are evaluated with diverse automatic metrics and data formats, hindering direct objective comparisons between EPR systems. In this study, we introduce Peransformer, a transformer-based low-informed EPR system designed to bridge the gap between low-informed and highly-informed EPR systems. Our approach incorporates a score-aware discriminator that leverages the underlying score-derived MIDI files and is trained on a score-to-performance paired, note-to-note aligned MIDI dataset. Experimental results demonstrate that Peransformer achieves state-of-the-art performance among low-informed systems, as validated by subjective evaluations. Furthermore, we extend existing automatic evaluation metrics for EPR systems and introduce generalized EPR metrics (GEM), enabling more direct, accurate, and reliable comparisons across EPR systems.

Paper Structure

This paper contains 17 sections, 4 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: The performance model and the discriminator.
  • Figure 2: Results of the subjective evaluation. The bar charts show the Mean Opinion Score (MOS) of the models. The error bars show the 95% Confidence Interval (CI). The statistical significance of Welch's t-tests between Peransformer and the corresponding model is indicated above the error bars. "ns", "*", "**", and "***" denote $p \geq 0.05$, $p < 0.05$, $p < 0.01$, and $p < 0.001$ respectively.
  • Figure 3: Velocity changes of the first 60 notes of Schubert's The Fantasie in C major, Op. 15.