Table of Contents
Fetching ...

SMART: Tuning a symbolic music generation system with an audio domain aesthetic reward

Nicolas Jonason, Luca Casini, Bob L. T. Sturm

TL;DR

This work addresses tuning a symbolic piano MIDI generator using an audio-domain aesthetic reward. It introduces SMART, which finetunes a symbolic model through rendering MIDI to audio, scoring with Meta Audiobox Aesthetics, and updating via Group Relative Preference Optimization with KL regularization toward a reference. The results show enhanced Content Enjoyment scores and tangible changes in MIDI features (more notes, polyphony, wider pitch range) but reveal that aggressive optimization harms output diversity; a small listening study confirms higher perceived enjoyability. The findings highlight the potential and limitations of using audio-domain rewards to guide symbolic music generation and point to future work on alternative rewards and multi-instrument settings.

Abstract

Recent work has proposed training machine learning models to predict aesthetic ratings for music audio. Our work explores whether such models can be used to finetune a symbolic music generation system with reinforcement learning, and what effect this has on the system outputs. To test this, we use group relative policy optimization to finetune a piano MIDI model with Meta Audiobox Aesthetics ratings of audio-rendered outputs as the reward. We find that this optimization has effects on multiple low-level features of the generated outputs, and improves the average subjective ratings in a preliminary listening study with $14$ participants. We also find that over-optimization dramatically reduces diversity of model outputs.

SMART: Tuning a symbolic music generation system with an audio domain aesthetic reward

TL;DR

This work addresses tuning a symbolic piano MIDI generator using an audio-domain aesthetic reward. It introduces SMART, which finetunes a symbolic model through rendering MIDI to audio, scoring with Meta Audiobox Aesthetics, and updating via Group Relative Preference Optimization with KL regularization toward a reference. The results show enhanced Content Enjoyment scores and tangible changes in MIDI features (more notes, polyphony, wider pitch range) but reveal that aggressive optimization harms output diversity; a small listening study confirms higher perceived enjoyability. The findings highlight the potential and limitations of using audio-domain rewards to guide symbolic music generation and point to future work on alternative rewards and multi-instrument settings.

Abstract

Recent work has proposed training machine learning models to predict aesthetic ratings for music audio. Our work explores whether such models can be used to finetune a symbolic music generation system with reinforcement learning, and what effect this has on the system outputs. To test this, we use group relative policy optimization to finetune a piano MIDI model with Meta Audiobox Aesthetics ratings of audio-rendered outputs as the reward. We find that this optimization has effects on multiple low-level features of the generated outputs, and improves the average subjective ratings in a preliminary listening study with participants. We also find that over-optimization dramatically reduces diversity of model outputs.

Paper Structure

This paper contains 18 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the SMART training setup. First, the policy and reference model are initialized from the pretrained base model. The reference model's weights are frozen. Then, each iteration, a prompt is passed to the policy model which generates MIDIs. The MIDIs are then rendered into audios which are assigned rewards by the aesthetic preference model. These rewards are then used to compute the group relative advantages which are then used to update the policy model. This optimization is regularized by an additional KL loss term which prevents the policy model from straying too far away from the reference model. This figure is adapted from shao2024deepseekmath.
  • Figure 2: Predicted content enjoyment rating from MAA across SMART training iterations.
  • Figure 3: Optimizing for predicted content enjoyment also increases the other ratings. MAA ratings of 1000 generations with random procedural prompts for the piano model pre and post audio reward optimization.
  • Figure 4: Optimizing towards the aesthetic reward affects distribution of low-level features in model outputs. Histograms of various track and note features from 1000 generations with random procedural prompts for the piano model before and after SMART training.
  • Figure 5: On the left: Overall distribution of the ratings. Red indicates the base model, blue indicates the post-intervention model. On the right: Distribution of the ratings from each subject.
  • ...and 2 more figures