SMART: Tuning a symbolic music generation system with an audio domain aesthetic reward
Nicolas Jonason, Luca Casini, Bob L. T. Sturm
TL;DR
This work addresses tuning a symbolic piano MIDI generator using an audio-domain aesthetic reward. It introduces SMART, which finetunes a symbolic model through rendering MIDI to audio, scoring with Meta Audiobox Aesthetics, and updating via Group Relative Preference Optimization with KL regularization toward a reference. The results show enhanced Content Enjoyment scores and tangible changes in MIDI features (more notes, polyphony, wider pitch range) but reveal that aggressive optimization harms output diversity; a small listening study confirms higher perceived enjoyability. The findings highlight the potential and limitations of using audio-domain rewards to guide symbolic music generation and point to future work on alternative rewards and multi-instrument settings.
Abstract
Recent work has proposed training machine learning models to predict aesthetic ratings for music audio. Our work explores whether such models can be used to finetune a symbolic music generation system with reinforcement learning, and what effect this has on the system outputs. To test this, we use group relative policy optimization to finetune a piano MIDI model with Meta Audiobox Aesthetics ratings of audio-rendered outputs as the reward. We find that this optimization has effects on multiple low-level features of the generated outputs, and improves the average subjective ratings in a preliminary listening study with $14$ participants. We also find that over-optimization dramatically reduces diversity of model outputs.
