SyMuPe: Affective and Controllable Symbolic Music Performance
Ilya Borovik, Dmitrii Gavrilev, Vladimir Viro
TL;DR
SyMuPe introduces a flexible framework for expressive symbolic piano performance modeling and presents PianoFlow, a state-of-the-art model built on conditional flow matching with OT-path conditioning. Leveraging a large, cleaned dataset of 2,968 hours of aligned score–MIDI data and multimodal controls from emotion classifiers and text embeddings, PianoFlow supports unconditional generation, inpainting, and text/emotion-driven control with real-time inference. The approach is validated against transformer baselines and external models, showing superior objective metrics and strong listening-test performance, frequently exceeding human MIDI samples in perceptual quality. The framework and tokenizer, along with the multi-mask training regime, provide a reusable platform for interactive and accessible expressive performance systems, while highlighting limitations in pedal modeling, trill handling, and dependence on emotion/text encoders.
Abstract
Emotions are fundamental to the creation and perception of music performances. However, achieving human-like expression and emotion through machine learning models for performance rendering remains a challenging task. In this work, we present SyMuPe, a novel framework for developing and training affective and controllable symbolic piano performance models. Our flagship model, PianoFlow, uses conditional flow matching trained to solve diverse multi-mask performance inpainting tasks. By design, it supports both unconditional generation and infilling of music performance features. For training, we use a curated, cleaned dataset of 2,968 hours of aligned musical scores and expressive MIDI performances. For text and emotion control, we integrate a piano performance emotion classifier and tune PianoFlow with the emotion-weighted Flan-T5 text embeddings provided as conditional inputs. Objective and subjective evaluations against transformer-based baselines and existing models show that PianoFlow not only outperforms other approaches, but also achieves performance quality comparable to that of human-recorded and transcribed MIDI samples. For emotion control, we present and analyze samples generated under different text conditioning scenarios. The developed model can be integrated into interactive applications, contributing to the creation of more accessible and engaging music performance systems.
