Table of Contents
Fetching ...

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck

TL;DR

The paper tackles the challenge of neural music generation by factorizing audio rendering into discrete note-based transcription, probabilistic sequence modeling, and MIDI-conditioned waveform synthesis, enabling long-term coherence across timescales. It introduces MAESTRO, a large, finely aligned dataset of piano performances and aligned MIDI, and a three-component pipeline: Onsets and Frames transcription, Music Transformer MIDI generation, and WaveNet-based synthesis conditioned on MIDI. The work achieves state-of-the-art piano transcription on MAESTRO/MAPS and shows perceptual plausibility through listening tests, with real and model-generated piano audio approaching indistinguishability in certain conditions. This modular, data-rich approach lays groundwork for scalable, interpretable neural music models and extension to other instruments.

Abstract

Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

TL;DR

The paper tackles the challenge of neural music generation by factorizing audio rendering into discrete note-based transcription, probabilistic sequence modeling, and MIDI-conditioned waveform synthesis, enabling long-term coherence across timescales. It introduces MAESTRO, a large, finely aligned dataset of piano performances and aligned MIDI, and a three-component pipeline: Onsets and Frames transcription, Music Transformer MIDI generation, and WaveNet-based synthesis conditioned on MIDI. The work achieves state-of-the-art piano transcription on MAESTRO/MAPS and shows perceptual plausibility through listening tests, with real and model-generated piano audio approaching indistinguishability in certain conditions. This modular, data-rich approach lays groundwork for scalable, interpretable neural music models and extension to other instruments.

Abstract

Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.

Paper Structure

This paper contains 10 sections, 1 equation, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Wave2Midi2Wave system architecture for our suite of piano music models, consisting of (a) a conditional WaveNet model that generates audio from MIDI, (b) a Music Transformer language model that generates piano performance MIDI autoregressively, and (c) a piano transcription model that "encodes" piano performance audio as MIDI.
  • Figure 2: Results of our listening tests, showing the number of times each source won in a pairwise comparison. Black error bars indicate estimated standard deviation of means.