Table of Contents
Fetching ...

The Ghost in the Keys: A Disklavier Demo for Human-AI Musical Co-Creativity

Louis Bradshaw, Alexander Spangher, Stella Biderman, Simon Colton

TL;DR

This work tackles the gap between asynchronous AI music generation and embodied live performance by introducing Aria-Duet, a Disklavier-based real-time piano duet with a state-of-the-art generative model. The system combines a finetuned autoregressive piano model with a low-latency real-time engine that supports a turn-taking interaction (Listen–Takeover–Generate) and bespoke Disklavier playback to maintain musical coherence and physical responsiveness. Key contributions include explicit pedal-token finetuning to improve sustain behavior, a continuous prefill KV-cache strategy to reduce takeover latency, and a zero-latency playback layer that respects Disklavier hardware constraints, plus a musicological demonstration showing semantic continuity and multi-voice dialogue. The results indicate plausible, coherent, and musically sophisticated co-creation, offering a practical blueprint for real-time embodied AI in live performance and suggesting directions for broader evaluation and adoption in human–AI musical collaboration.

Abstract

While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To address this, we introduce Aria-Duet, an interactive system facilitating a real-time musical duet between a human pianist and Aria, a state-of-the-art generative model, using a Yamaha Disklavier as a shared physical interface. The framework enables a turn-taking collaboration: the user performs, signals a handover, and the model generates a coherent continuation performed acoustically on the piano. Beyond describing the technical architecture enabling this low-latency interaction, we analyze the system's output from a musicological perspective, finding the model can maintain stylistic semantics and develop coherent phrasal ideas, demonstrating that such embodied systems can engage in musically sophisticated dialogue and open a promising new path for human-AI co-creation.

The Ghost in the Keys: A Disklavier Demo for Human-AI Musical Co-Creativity

TL;DR

This work tackles the gap between asynchronous AI music generation and embodied live performance by introducing Aria-Duet, a Disklavier-based real-time piano duet with a state-of-the-art generative model. The system combines a finetuned autoregressive piano model with a low-latency real-time engine that supports a turn-taking interaction (Listen–Takeover–Generate) and bespoke Disklavier playback to maintain musical coherence and physical responsiveness. Key contributions include explicit pedal-token finetuning to improve sustain behavior, a continuous prefill KV-cache strategy to reduce takeover latency, and a zero-latency playback layer that respects Disklavier hardware constraints, plus a musicological demonstration showing semantic continuity and multi-voice dialogue. The results indicate plausible, coherent, and musically sophisticated co-creation, offering a practical blueprint for real-time embodied AI in live performance and suggesting directions for broader evaluation and adoption in human–AI musical collaboration.

Abstract

While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To address this, we introduce Aria-Duet, an interactive system facilitating a real-time musical duet between a human pianist and Aria, a state-of-the-art generative model, using a Yamaha Disklavier as a shared physical interface. The framework enables a turn-taking collaboration: the user performs, signals a handover, and the model generates a coherent continuation performed acoustically on the piano. Beyond describing the technical architecture enabling this low-latency interaction, we analyze the system's output from a musicological perspective, finding the model can maintain stylistic semantics and develop coherent phrasal ideas, demonstrating that such embodied systems can engage in musically sophisticated dialogue and open a promising new path for human-AI co-creation.

Paper Structure

This paper contains 13 sections, 2 figures.

Figures (2)

  • Figure 1: An illustration of the KV-Cache management for real-time, low-latency operation. (1) Listen: As the user plays, the received context is proactively and continuously prefilled into the model's KV-Cache in chunks. (2) Takeover: Upon a takeover signal, the system finalizes the input, prefilling any missing context and speculatively re-evaluating the durations of any hanging notes (seen in blue), ensuring a seamless transition and preparing the KV-Cache. (3) Generate: The model then begins generating a musical continuation note-by-note, streaming the result to the Disklavier.
  • Figure 2: Musician interacts with the Aria-Duet system using a Disklavier.