Table of Contents
Fetching ...

Expressive MIDI-format Piano Performance Generation

Jingwei Liu

TL;DR

The paper addresses expressive piano performance generation using symbolic MIDI rather than raw audio, arguing that MIDI can encode rich expressive elements with potentially lower data requirements. It proposes listening-based data processing that abandons fixed time grids, adopts millisecond-level timing, and uses Weber's law-driven, perceptual quantization (Mel quantization) for note events, while explicitly modeling sustain pedal as a control channel. A Convolved Multi-argument LSTM with attention handles five interdependent input/output streams to generate polyphonic, expressive MIDI sequences, including precise onset timings, durations, velocities, and pedal states. The work is candid about its preliminary nature, noting limited training and the need for substantial fine-tuning, but demonstrates a framework aimed at achieving expressive symbolic music that could rival audio-based generation while maintaining interpretability and control. If extended and trained more extensively, this approach could offer a scalable, perceptually aligned pathway for high-fidelity symbolic music generation with practical musical reach.

Abstract

This work presents a generative neural network that's able to generate expressive piano performance in MIDI format. The musical expressivity is reflected by vivid micro-timing, rich polyphonic texture, varied dynamics, and the sustain pedal effects. This model is innovative from many aspects of data processing to neural network design. We claim that this symbolic music generation model overcame the common critics of symbolic music and is able to generate expressive music flows as good as, if not better than generations with raw audio. One drawback is that, due to the limited time for submission, the model is not fine-tuned and sufficiently trained, thus the generation may sound incoherent and random at certain points. Despite that, this model shows its powerful generative ability to generate expressive piano pieces.

Expressive MIDI-format Piano Performance Generation

TL;DR

The paper addresses expressive piano performance generation using symbolic MIDI rather than raw audio, arguing that MIDI can encode rich expressive elements with potentially lower data requirements. It proposes listening-based data processing that abandons fixed time grids, adopts millisecond-level timing, and uses Weber's law-driven, perceptual quantization (Mel quantization) for note events, while explicitly modeling sustain pedal as a control channel. A Convolved Multi-argument LSTM with attention handles five interdependent input/output streams to generate polyphonic, expressive MIDI sequences, including precise onset timings, durations, velocities, and pedal states. The work is candid about its preliminary nature, noting limited training and the need for substantial fine-tuning, but demonstrates a framework aimed at achieving expressive symbolic music that could rival audio-based generation while maintaining interpretability and control. If extended and trained more extensively, this approach could offer a scalable, perceptually aligned pathway for high-fidelity symbolic music generation with practical musical reach.

Abstract

This work presents a generative neural network that's able to generate expressive piano performance in MIDI format. The musical expressivity is reflected by vivid micro-timing, rich polyphonic texture, varied dynamics, and the sustain pedal effects. This model is innovative from many aspects of data processing to neural network design. We claim that this symbolic music generation model overcame the common critics of symbolic music and is able to generate expressive music flows as good as, if not better than generations with raw audio. One drawback is that, due to the limited time for submission, the model is not fine-tuned and sufficiently trained, thus the generation may sound incoherent and random at certain points. Despite that, this model shows its powerful generative ability to generate expressive piano pieces.
Paper Structure (4 sections, 2 figures)

This paper contains 4 sections, 2 figures.

Figures (2)

  • Figure 1: The categorical distributions for given input features. The divisions obey Weber's law where the perceptual changes are proportional to the current values.
  • Figure 2: LSTM-Attention cell. A recurrent neural network designed for Multi-input-output generative system.