Towards An Integrated Approach for Expressive Piano Performance Synthesis from Music Scores
Jingjing Tang, Erica Cooper, Xin Wang, Junichi Yamagishi, George Fazekas
TL;DR
This work tackles the problem of turning symbolic piano scores into expressive audio by proposing an integrated two-stage system: a Transformer-based M2M module that renders expressive MIDI from score MIDI, and a fine-tuned M2A MIDI-to-Audio synthesiser that converts the expressive MIDI to audio. The authors align score and performance MIDI using Nakamura-style methods, tokenize MIDI at the feature level with an octuple scheme, and adapt the M2M and M2A models with careful architectural and training choices, including GradNorm and segment-based processing. Objective metrics and MOS-based listening tests on subsets of the ATEPP-1.2 dataset show the system can reconstruct velocity and IOI effectively, capture acoustic ambience, and deliver improved expressiveness over baselines, albeit with limitations in pedalling and some audio-quality aspects. The work demonstrates a practical path toward end-to-end expressive piano synthesis from scores, with future work focused on pedalling prediction, chromagram-aware losses, and broader environmental generalisation to enhance real-world applicability.
Abstract
This paper presents an integrated system that transforms symbolic music scores into expressive piano performance audio. By combining a Transformer-based Expressive Performance Rendering (EPR) model with a fine-tuned neural MIDI synthesiser, our approach directly generates expressive audio performances from score inputs. To the best of our knowledge, this is the first system to offer a streamlined method for converting score MIDI files lacking expression control into rich, expressive piano performances. We conducted experiments using subsets of the ATEPP dataset, evaluating the system with both objective metrics and subjective listening tests. Our system not only accurately reconstructs human-like expressiveness, but also captures the acoustic ambience of environments such as concert halls and recording studios. Additionally, the proposed system demonstrates its ability to achieve musical expressiveness while ensuring good audio quality in its outputs.
