Table of Contents
Fetching ...

Echoes of Humanity: Exploring the Perceived Humanness of AI Music

Flavio Figueiredo, Giovanni Martinelli, Henrique Sousa, Pedro Rodrigues, Frederico Pedrosa, Lucas N. Ferreira

TL;DR

This work probes how humans perceive AI-generated music (AIM) versus human-composed music using a blind, Turing-style test implemented as a randomized controlled crossover trial (RCCT) and a mixed-methods content analysis. The study introduces an in-the-wild AIM dataset sourced from Reddit/Suno and pairs it with human songs from MTG-Jamendo, enabling causal assessment of discriminability across random and highly similar song pairs. Results show that listeners struggle to distinguish AIM from human music in random pairs but achieve significant discrimination when pairs are highly similar, with accuracy rising from around $0.53$ to $0.66$; longer musical experience and prior AIM knowledge further boost performance. Qualitative feedback reveals that vocal quality, lyrics, and production cues dominate judgments, informing both model development to sound more human-like and educational efforts to help users detect AIM. The work provides a causal, data-rich view on humanness perception and releases data and code to support reproducibility and further study.

Abstract

Recent advances in AI music (AIM) generation services are currently transforming the music industry. Given these advances, understanding how humans perceive AIM is crucial both to educate users on identifying AIM songs, and, conversely, to improve current models. We present results from a listener-focused experiment aimed at understanding how humans perceive AIM. In a blind, Turing-like test, participants were asked to distinguish, from a pair, the AIM and human-made song. We contrast with other studies by utilizing a randomized controlled crossover trial that controls for pairwise similarity and allows for a causal interpretation. We are also the first study to employ a novel, author-uncontrolled dataset of AIM songs from real-world usage of commercial models (i.e., Suno). We establish that listeners' reliability in distinguishing AIM causally increases when pairs are similar. Lastly, we conduct a mixed-methods content analysis of listeners' free-form feedback, revealing a focus on vocal and technical cues in their judgments.

Echoes of Humanity: Exploring the Perceived Humanness of AI Music

TL;DR

This work probes how humans perceive AI-generated music (AIM) versus human-composed music using a blind, Turing-style test implemented as a randomized controlled crossover trial (RCCT) and a mixed-methods content analysis. The study introduces an in-the-wild AIM dataset sourced from Reddit/Suno and pairs it with human songs from MTG-Jamendo, enabling causal assessment of discriminability across random and highly similar song pairs. Results show that listeners struggle to distinguish AIM from human music in random pairs but achieve significant discrimination when pairs are highly similar, with accuracy rising from around to ; longer musical experience and prior AIM knowledge further boost performance. Qualitative feedback reveals that vocal quality, lyrics, and production cues dominate judgments, informing both model development to sound more human-like and educational efforts to help users detect AIM. The work provides a causal, data-rich view on humanness perception and releases data and code to support reproducibility and further study.

Abstract

Recent advances in AI music (AIM) generation services are currently transforming the music industry. Given these advances, understanding how humans perceive AIM is crucial both to educate users on identifying AIM songs, and, conversely, to improve current models. We present results from a listener-focused experiment aimed at understanding how humans perceive AIM. In a blind, Turing-like test, participants were asked to distinguish, from a pair, the AIM and human-made song. We contrast with other studies by utilizing a randomized controlled crossover trial that controls for pairwise similarity and allows for a causal interpretation. We are also the first study to employ a novel, author-uncontrolled dataset of AIM songs from real-world usage of commercial models (i.e., Suno). We establish that listeners' reliability in distinguishing AIM causally increases when pairs are similar. Lastly, we conduct a mixed-methods content analysis of listeners' free-form feedback, revealing a focus on vocal and technical cues in their judgments.

Paper Structure

This paper contains 13 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Demographic and Answer Variables
  • Figure 2: Topics and tags. Word size is proportional to usage within topic. Top-7 overall frequency: vocals (369), lyrics (247), negative (231), artificial (224), generic (174), human (130), robotic (112)
  • Figure 3: Observed Topic Frequencies and Differences Towards the Expected. ***$p < .01$
  • Figure 3: Alternative Models. Observe that we progressively remove covariates. Regardless of the model, the exposure to similar pairs is always significant.