Table of Contents
Fetching ...

Towards An Integrated Approach for Expressive Piano Performance Synthesis from Music Scores

Jingjing Tang, Erica Cooper, Xin Wang, Junichi Yamagishi, George Fazekas

TL;DR

This work tackles the problem of turning symbolic piano scores into expressive audio by proposing an integrated two-stage system: a Transformer-based M2M module that renders expressive MIDI from score MIDI, and a fine-tuned M2A MIDI-to-Audio synthesiser that converts the expressive MIDI to audio. The authors align score and performance MIDI using Nakamura-style methods, tokenize MIDI at the feature level with an octuple scheme, and adapt the M2M and M2A models with careful architectural and training choices, including GradNorm and segment-based processing. Objective metrics and MOS-based listening tests on subsets of the ATEPP-1.2 dataset show the system can reconstruct velocity and IOI effectively, capture acoustic ambience, and deliver improved expressiveness over baselines, albeit with limitations in pedalling and some audio-quality aspects. The work demonstrates a practical path toward end-to-end expressive piano synthesis from scores, with future work focused on pedalling prediction, chromagram-aware losses, and broader environmental generalisation to enhance real-world applicability.

Abstract

This paper presents an integrated system that transforms symbolic music scores into expressive piano performance audio. By combining a Transformer-based Expressive Performance Rendering (EPR) model with a fine-tuned neural MIDI synthesiser, our approach directly generates expressive audio performances from score inputs. To the best of our knowledge, this is the first system to offer a streamlined method for converting score MIDI files lacking expression control into rich, expressive piano performances. We conducted experiments using subsets of the ATEPP dataset, evaluating the system with both objective metrics and subjective listening tests. Our system not only accurately reconstructs human-like expressiveness, but also captures the acoustic ambience of environments such as concert halls and recording studios. Additionally, the proposed system demonstrates its ability to achieve musical expressiveness while ensuring good audio quality in its outputs.

Towards An Integrated Approach for Expressive Piano Performance Synthesis from Music Scores

TL;DR

This work tackles the problem of turning symbolic piano scores into expressive audio by proposing an integrated two-stage system: a Transformer-based M2M module that renders expressive MIDI from score MIDI, and a fine-tuned M2A MIDI-to-Audio synthesiser that converts the expressive MIDI to audio. The authors align score and performance MIDI using Nakamura-style methods, tokenize MIDI at the feature level with an octuple scheme, and adapt the M2M and M2A models with careful architectural and training choices, including GradNorm and segment-based processing. Objective metrics and MOS-based listening tests on subsets of the ATEPP-1.2 dataset show the system can reconstruct velocity and IOI effectively, capture acoustic ambience, and deliver improved expressiveness over baselines, albeit with limitations in pedalling and some audio-quality aspects. The work demonstrates a practical path toward end-to-end expressive piano synthesis from scores, with future work focused on pedalling prediction, chromagram-aware losses, and broader environmental generalisation to enhance real-world applicability.

Abstract

This paper presents an integrated system that transforms symbolic music scores into expressive piano performance audio. By combining a Transformer-based Expressive Performance Rendering (EPR) model with a fine-tuned neural MIDI synthesiser, our approach directly generates expressive audio performances from score inputs. To the best of our knowledge, this is the first system to offer a streamlined method for converting score MIDI files lacking expression control into rich, expressive piano performances. We conducted experiments using subsets of the ATEPP dataset, evaluating the system with both objective metrics and subjective listening tests. Our system not only accurately reconstructs human-like expressiveness, but also captures the acoustic ambience of environments such as concert halls and recording studios. Additionally, the proposed system demonstrates its ability to achieve musical expressiveness while ensuring good audio quality in its outputs.
Paper Structure (16 sections, 3 figures, 4 tables)

This paper contains 16 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the proposed score-to-audio system. The left part (M2M) illustrates our EPR model, featuring a Transformer encoder. The right section (M2A) details the architecture of the MIDI synthesiser, which incorporates a Transformer model li2019neural adapted from text-to-speech (TTS) tasks and a HiFi-GAN vocoder kong2020hifi.
  • Figure 2: Architecture of the M2M model, adapted from our previous work tang2023reconstructing Modifications to the original design are highlighted in red font and enclosed in red boxes for clarity.
  • Figure 3: MOS for systems listed in Table \ref{['tab:listening_test']} from Test A. Scores with respect to the internal and external compositions are presented.