Reinforcement Learning Jazz Improvisation: When Music Meets Game Theory
Vedant Tapiavala, Joshua Piesner, Sourjyamoy Barman, Feng Fu
TL;DR
This work frames jazz improvisation as a two-player, payoff-based game and uses reinforcement learning to study strategic interactions under a chord-driven blues structure. The payoff combines a variance (diversity) component $V$ and a harmony component $H$ into $P=rac{VM-H}{VM+H}$ with a balancing factor $M\,\approx\,1208.7571$, enabling quantitative comparison across strategies. Key findings show that Chord-Following Reinforcement Learning paired with Stepwise Changes achieves the highest mean payoffs, while Harmony Prediction—though learning-based—can produce unstable loops and high variance; non-RL baselines perform poorly, and RL strategies generally improve over time. These results offer a quantitative lens on improvisational strategy and motivate AI-assisted analysis and training on jazz solos to further refine reward structures and strategy adaptation in musical games.
Abstract
Live performances of music are always charming, with the unpredictability of improvisation due to the dynamic between musicians and interactions with the audience. Jazz improvisation is a particularly noteworthy example for further investigation from a theoretical perspective. Here, we introduce a novel mathematical game theory model for jazz improvisation, providing a framework for studying music theory and improvisational methodologies. We use computational modeling, mainly reinforcement learning, to explore diverse stochastic improvisational strategies and their paired performance on improvisation. We find that the most effective strategy pair is a strategy that reacts to the most recent payoff (Stepwise Changes) with a reinforcement learning strategy limited to notes in the given chord (Chord-Following Reinforcement Learning). Conversely, a strategy that reacts to the partner's last note and attempts to harmonize with it (Harmony Prediction) strategy pair yields the lowest non-control payoff and highest standard deviation, indicating that picking notes based on immediate reactions to the partner player can yield inconsistent outcomes. On average, the Chord-Following Reinforcement Learning strategy demonstrates the highest mean payoff, while Harmony Prediction exhibits the lowest. Our work lays the foundation for promising applications beyond jazz: including the use of artificial intelligence (AI) models to extract data from audio clips to refine musical reward systems, and training machine learning (ML) models on existing jazz solos to further refine strategies within the game.
