Table of Contents
Fetching ...

Generating Piano Music with Transformers: A Comparative Study of Scale, Data, and Metrics

Jonathan Lehmkuhl, Ábel Ilyés-Kun, Nico Bremes, Cemhan Kaan Özaltan, Frederik Muthers, Jiayi Yuan

TL;DR

This work tackles how design choices in transformer-based symbolic piano generation affect musical quality. It conducts a systematic set of experiments varying model size (62M–950M), pre-training data (MAESTRO vs Aria-MIDI), fine-tuning, and genre conditioning, leveraging REMI tokenization and multiple objective and subjective metrics, plus a musical Turing-like test. Key findings show that larger models improve subjective quality but risk overfitting on small datasets, while pre-training on a large, diverse dataset and subsequent fine-tuning on MAESTRO enhances both subjective and objective performance; FMD, KLD, and OA align with human judgments better than PPL, and genre conditioning offers style control with strong outputs for the largest model. The results provide practical guidance for dataset scaling, transfer learning, and evaluation in symbolic music generation, and call for robust benchmarks and high-level musical metrics to enable fair cross-study comparisons.

Abstract

Although a variety of transformers have been proposed for symbolic music generation in recent years, there is still little comprehensive study on how specific design choices affect the quality of the generated music. In this work, we systematically compare different datasets, model architectures, model sizes, and training strategies for the task of symbolic piano music generation. To support model development and evaluation, we examine a range of quantitative metrics and analyze how well they correlate with human judgment collected through listening studies. Our best-performing model, a 950M-parameter transformer trained on 80K MIDI files from diverse genres, produces outputs that are often rated as human-composed in a Turing-style listening survey.

Generating Piano Music with Transformers: A Comparative Study of Scale, Data, and Metrics

TL;DR

This work tackles how design choices in transformer-based symbolic piano generation affect musical quality. It conducts a systematic set of experiments varying model size (62M–950M), pre-training data (MAESTRO vs Aria-MIDI), fine-tuning, and genre conditioning, leveraging REMI tokenization and multiple objective and subjective metrics, plus a musical Turing-like test. Key findings show that larger models improve subjective quality but risk overfitting on small datasets, while pre-training on a large, diverse dataset and subsequent fine-tuning on MAESTRO enhances both subjective and objective performance; FMD, KLD, and OA align with human judgments better than PPL, and genre conditioning offers style control with strong outputs for the largest model. The results provide practical guidance for dataset scaling, transfer learning, and evaluation in symbolic music generation, and call for robust benchmarks and high-level musical metrics to enable fair cross-study comparisons.

Abstract

Although a variety of transformers have been proposed for symbolic music generation in recent years, there is still little comprehensive study on how specific design choices affect the quality of the generated music. In this work, we systematically compare different datasets, model architectures, model sizes, and training strategies for the task of symbolic piano music generation. To support model development and evaluation, we examine a range of quantitative metrics and analyze how well they correlate with human judgment collected through listening studies. Our best-performing model, a 950M-parameter transformer trained on 80K MIDI files from diverse genres, produces outputs that are often rated as human-composed in a Turing-style listening survey.

Paper Structure

This paper contains 16 sections, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Training curves for the MAESTRO models of different sizes.
  • Figure 2: Training curves for the models pre-trained on Aria-Deduped compared to the 155M model trained only on MAESTRO.
  • Figure 3: Training curves for the models pre-trained on Aria-Deduped and fine-tuned on MAESTRO.
  • Figure 4: Training curves for the models with integrated genre information.
  • Figure 5: Fine-tuning curves for Moonbeam with context size 512.
  • ...and 7 more figures