Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Yusong Wu; Stephen Brade; Teng Ma; Tia-Jane Fowler; Enning Yang; Berker Banar; Aaron Courville; Natasha Jaques; Cheng-Zhi Anna Huang

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Yusong Wu, Stephen Brade, Teng Ma, Tia-Jane Fowler, Enning Yang, Berker Banar, Aaron Courville, Natasha Jaques, Cheng-Zhi Anna Huang

TL;DR

This work addresses reward hacking in reinforcement learning post-training for real-time live music interaction, where diversity and adaptability are essential. It introduces Generative Adversarial Post-Training (GAPT), which jointly trains a chord-generation policy with a co-evolving discriminator that provides an adversarial reward $R_{adv}=-\log(1-D_{eta}(y))$, alongside a coherence-based task reward, and uses a two-phase discriminator update to stabilize learning. The authors demonstrate that GAPT preserves harmonic coherence while restoring progression diversity across fixed melodies, model-model co-adaptation, and a real-time human-in-the-loop study with expert musicians, surpassing baselines in adaptation speed and perceived agency. This approach offers a practical, scalable mitigation for reward hacking in RL post-training of generative sequence models and has potential for extension to multi-agent co-creative settings and personalized preferences.

Abstract

Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player's future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

TL;DR

Abstract

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)