Table of Contents
Fetching ...

If Turing played piano with an artificial partner

Dobromir Dotov, Dante Camarena, Zack Harris, Joanna Spyra, Pietro Gagliano, Laurel Trainor

TL;DR

The paper investigates whether a generative neural network can produce a convincing social experience during joint piano play with humans by prioritizing interaction over mere score-generation. Using MusicVAE, researchers implemented a timed turn-taking task and compared human-human to human-AI duets across multiple AI configurations. Results show AI partners are generally rated below humans, but certain 2-bar configurations with high imitation can approach human levels on realism and ease of interaction, with self-other integration sometimes aligning with human partners. The study highlights the importance of interaction-focused training and timing in social AI for music, suggesting future work on architectures capable of dynamic synchronization and embodied coordination to achieve richer social experiences.

Abstract

Music is an inherently social activity that allows people to share experiences and feel connected with one another. There has been little progress in designing artificial partners exhibiting a similar social experience as playing with another person. Neural network architectures that implement generative models, such as large language models, are suited for producing musical scores. Playing music socially, however, involves more than playing a score; it must complement the other musicians' ideas and keep time correctly. We addressed the question of whether a convincing social experience is made possible by a generative model trained to produce musical scores, not necessarily optimized for synchronization and continuation. The network, a variational autoencoder trained on a large corpus of digital scores, was adapted for a timed call-and-response task with a human partner. Participants played piano with a human or artificial partner-in various configurations-and rated the performance quality and first-person experience of self-other integration. Overall, the artificial partners held promise but were rated lower than human partners. The artificial partner with simplest design and highest similarity parameter was not rated differently from the human partners on some measures, suggesting that interactive rather than generative sophistication is important in enabling social AI.

If Turing played piano with an artificial partner

TL;DR

The paper investigates whether a generative neural network can produce a convincing social experience during joint piano play with humans by prioritizing interaction over mere score-generation. Using MusicVAE, researchers implemented a timed turn-taking task and compared human-human to human-AI duets across multiple AI configurations. Results show AI partners are generally rated below humans, but certain 2-bar configurations with high imitation can approach human levels on realism and ease of interaction, with self-other integration sometimes aligning with human partners. The study highlights the importance of interaction-focused training and timing in social AI for music, suggesting future work on architectures capable of dynamic synchronization and embodied coordination to achieve richer social experiences.

Abstract

Music is an inherently social activity that allows people to share experiences and feel connected with one another. There has been little progress in designing artificial partners exhibiting a similar social experience as playing with another person. Neural network architectures that implement generative models, such as large language models, are suited for producing musical scores. Playing music socially, however, involves more than playing a score; it must complement the other musicians' ideas and keep time correctly. We addressed the question of whether a convincing social experience is made possible by a generative model trained to produce musical scores, not necessarily optimized for synchronization and continuation. The network, a variational autoencoder trained on a large corpus of digital scores, was adapted for a timed call-and-response task with a human partner. Participants played piano with a human or artificial partner-in various configurations-and rated the performance quality and first-person experience of self-other integration. Overall, the artificial partners held promise but were rated lower than human partners. The artificial partner with simplest design and highest similarity parameter was not rated differently from the human partners on some measures, suggesting that interactive rather than generative sophistication is important in enabling social AI.
Paper Structure (14 sections, 4 figures, 1 table)

This paper contains 14 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Visual representation of the tasks where (a) the human-human duo take turns playing their individual pianos, as indicated on the screen by a diminishing coloured bar labelled Player A and Player B; and (b) human-AI turns are similarly indicated through a diminishing coloured bar, and coloured boxes representing the notes played are scrolling upwards on the screen.
  • Figure 2: Schematic of the architecture used to train generative models of piano note sequences. In training, the variational autoencoder is trying to reproduce the input sequences from a curated dataset by encoding them, passing them through a lower dimensional latent space with the form of a multivariate probability distribution, and then decoding them. Later, the trained generative model can be used to mimic or reconstruct input sequences as well as sample and interpolate between learned sequences.
  • Figure 3: Ratings of performance quality. H: human-human performance. 2B: generative model with a two-bar time span, 4B: generative model with a four-bar time span, -T: low temperature, +T: high temperature, -S: low similarity, +S: high similarity.
  • Figure 4: Ratings of experience. H: human-human performance. 2B: generative model with a two-bar time span, 4B: generative model with a four-bar time span, -T: low temperature, +T: high temperature, -S: low similarity, +S: high similarity.