The Ghost in the Keys: A Disklavier Demo for Human-AI Musical Co-Creativity
Louis Bradshaw, Alexander Spangher, Stella Biderman, Simon Colton
TL;DR
This work tackles the gap between asynchronous AI music generation and embodied live performance by introducing Aria-Duet, a Disklavier-based real-time piano duet with a state-of-the-art generative model. The system combines a finetuned autoregressive piano model with a low-latency real-time engine that supports a turn-taking interaction (Listen–Takeover–Generate) and bespoke Disklavier playback to maintain musical coherence and physical responsiveness. Key contributions include explicit pedal-token finetuning to improve sustain behavior, a continuous prefill KV-cache strategy to reduce takeover latency, and a zero-latency playback layer that respects Disklavier hardware constraints, plus a musicological demonstration showing semantic continuity and multi-voice dialogue. The results indicate plausible, coherent, and musically sophisticated co-creation, offering a practical blueprint for real-time embodied AI in live performance and suggesting directions for broader evaluation and adoption in human–AI musical collaboration.
Abstract
While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To address this, we introduce Aria-Duet, an interactive system facilitating a real-time musical duet between a human pianist and Aria, a state-of-the-art generative model, using a Yamaha Disklavier as a shared physical interface. The framework enables a turn-taking collaboration: the user performs, signals a handover, and the model generates a coherent continuation performed acoustically on the piano. Beyond describing the technical architecture enabling this low-latency interaction, we analyze the system's output from a musicological perspective, finding the model can maintain stylistic semantics and develop coherent phrasal ideas, demonstrating that such embodied systems can engage in musically sophisticated dialogue and open a promising new path for human-AI co-creation.
