Table of Contents
Fetching ...

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Yusong Wu, Stephen Brade, Teng Ma, Tia-Jane Fowler, Enning Yang, Berker Banar, Aaron Courville, Natasha Jaques, Cheng-Zhi Anna Huang

TL;DR

This work addresses reward hacking in reinforcement learning post-training for real-time live music interaction, where diversity and adaptability are essential. It introduces Generative Adversarial Post-Training (GAPT), which jointly trains a chord-generation policy with a co-evolving discriminator that provides an adversarial reward $R_{adv}=-\log(1-D_{eta}(y))$, alongside a coherence-based task reward, and uses a two-phase discriminator update to stabilize learning. The authors demonstrate that GAPT preserves harmonic coherence while restoring progression diversity across fixed melodies, model-model co-adaptation, and a real-time human-in-the-loop study with expert musicians, surpassing baselines in adaptation speed and perceived agency. This approach offers a practical, scalable mitigation for reward hacking in RL post-training of generative sequence models and has potential for extension to multi-agent co-creative settings and personalized preferences.

Abstract

Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player's future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

TL;DR

This work addresses reward hacking in reinforcement learning post-training for real-time live music interaction, where diversity and adaptability are essential. It introduces Generative Adversarial Post-Training (GAPT), which jointly trains a chord-generation policy with a co-evolving discriminator that provides an adversarial reward , alongside a coherence-based task reward, and uses a two-phase discriminator update to stabilize learning. The authors demonstrate that GAPT preserves harmonic coherence while restoring progression diversity across fixed melodies, model-model co-adaptation, and a real-time human-in-the-loop study with expert musicians, surpassing baselines in adaptation speed and perceived agency. This approach offers a practical, scalable mitigation for reward hacking in RL post-training of generative sequence models and has potential for extension to multi-agent co-creative settings and personalized preferences.

Abstract

Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player's future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.

Paper Structure

This paper contains 46 sections, 6 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Left: RL post-training enables real-time adaptation for melody-to-chord accompaniment but is vulnerable to reward hacking: the policy exploits the coherence reward $R(x,y)$ by repeating simple, high scoring chords, which reduces diversity and breaks creative flow. Right: We propose an adversarial reward signal to prevent reward hacking. A discriminator $D_{\psi}(y)$ trained to distinguish policy rollouts from data, with its realism estimation added to the reward. This regularizes the policy toward natural accompaniment while preserving input coherence, preventing diversity collapse.
  • Figure 2: Under the same melody input stream (first row) in a live accompaniment setting, the model trained without adversarial reward (second row) produces harmonically coherent yet unnatural progressions with repetitive, trivial, and low-coverage chord choices that hinder human-AI interaction. In contrast, GAPT (third row) produces coherent, natural, and diverse live chord accompaniment by jointly training the policy with a discriminator that supplies an adversarial reward.
  • Figure 3: Participant ratings for real-time jamming with each model. Error bars show standard error. GAPT has the highest mean on all three evaluation questions and significantly improves adaptation speed and perceived control and agency over ReaLchords ($p<0.05$). The improved user experience benefits from higher diversity under generative adversarial post-training.
  • Figure 4: GAPT advances the Pareto frontier for diversity versus harmony. In simulated interaction on the test set (a) and on an out-of-distribution dataset (b), GAPT attains higher diversity while preserving strong harmony. By contrast, Online MLE without RL produces diverse outputs but fails at harmonic coherence during interactive generation. ReaLchords and GAPT without adversarial training achieves strong harmony at the cost of diversity. The t-SNE visualization of test set generations (c) likewise shows that GAPT covers a broader region of the accompaniment space.
  • Figure 5: Harmony and diversity evaluated with a learned melody jamming agent (a) and in live user sessions (b). GAPT preserves harmonic coherence while restoring progression diversity compared to ReaLchords, yielding a better harmony and diversity tradeoff and higher perceived control and adaptation speed.
  • ...and 1 more figures