A Rational Analysis of the Speech-to-Song Illusion

Raja Marjieh; Pol van Rijn; Ilia Sucholutsky; Harin Lee; Thomas L. Griffiths; Nori Jacoby

A Rational Analysis of the Speech-to-Song Illusion

Raja Marjieh, Pol van Rijn, Ilia Sucholutsky, Harin Lee, Thomas L. Griffiths, Nori Jacoby

TL;DR

The paper tackles why repetition makes spoken phrases sound more song-like and why this transformation varies by phrase. It formulates a principled Bayesian framework that treats the illusion as a two-hypothesis inference problem, comparing $p(s^n|\text{song})$ and $p(s^n|\text{speech})$ using corpus-derived statistics and, crucially, tests the approach with text-only data. It demonstrates that a purely textual prose-to-lyrics illusion emerges when repetition shifts the log-odds in favor of song, a prediction supported by both human judgments and GPT-4 outputs. Together, these results provide a parsimonious, cross-modal computational account of perceptual illusions rooted in learned linguistic statistics and suggest avenues for richer generative modeling across languages and modalities.

Abstract

The speech-to-song illusion is a robust psychological phenomenon whereby a spoken sentence sounds increasingly more musical as it is repeated. Despite decades of research, a complete formal account of this transformation is still lacking, and some of its nuanced characteristics, namely, that certain phrases appear to transform while others do not, is not well understood. Here we provide a formal account of this phenomenon, by recasting it as a statistical inference whereby a rational agent attempts to decide whether a sequence of utterances is more likely to have been produced in a song or speech. Using this approach and analyzing song and speech corpora, we further introduce a novel prose-to-lyrics illusion that is purely text-based. In this illusion, simply duplicating written sentences makes them appear more like song lyrics. We provide robust evidence for this new illusion in both human participants and large language models.

A Rational Analysis of the Speech-to-Song Illusion

TL;DR

and

using corpus-derived statistics and, crucially, tests the approach with text-only data. It demonstrates that a purely textual prose-to-lyrics illusion emerges when repetition shifts the log-odds in favor of song, a prediction supported by both human judgments and GPT-4 outputs. Together, these results provide a parsimonious, cross-modal computational account of perceptual illusions rooted in learned linguistic statistics and suggest avenues for richer generative modeling across languages and modalities.

Abstract

Paper Structure (15 sections, 3 equations, 5 figures)

This paper contains 15 sections, 3 equations, 5 figures.

Introduction
Background
Speech-to-Song Illusion
Rational Analysis
A Generative Account of Speech-to-Song
Modeling Methods
Text Preprocessing and Analysis
Text Corpora for Speech and Song
Modeling Results
Behavioral Methods
Behavioral Paradigm
Stimuli
Behavioral Results
Discussion
Acknowledgments

Figures (5)

Figure 1: The speech-to-song paradigm and the newly proposed prose-to-lyrics paradigm.
Figure 2: Aggregate repetition analysis. A. Log-likelihood curves for the naturalistic dataset as a function of the number of repetitions (aggregated across all words). B. Log-odds ratio for the naturalistic dataset (positive values favor song). C. and D. Similar to A and B for the synthetic (LLM-generated) dataset. Dashed red line indicates equal-likelihood threshold. Shaded area corresponds to 95% confidence intervals bootstrapped over documents.
Figure 3: Log-odds profiles for common sentences constructed from the naturalistic speech and song corpora under a naïve Bayes approximation. A. Absolute log-odds profiles, dashed red line indicates equal-likelihood threshold, positive values favor song. B. Relative log-odds profiles, with the value at zero subtracted (indicated by the dashed red line).
Figure 4: Behavioral results of the prose-to-lyrics experiments. A. Mean human ratings as a function of repetitions (rescaled to a 0-1 range via division by a factor of 6; dashed red line indicates speech-song equality threshold). Inset shows the data grouped based on a median split of the GPT4 scores at 4 repetitions. Shaded area indicates 95% confidence intervals (CIs) bootstrapped over participants. B. Mean rating as a function of repetition for the GPT4 control experiment. C. Mean sentence-level human ratings. D. Mean sentence-level human ratings relative to the value at 0 repetitions.
Figure 5: Prose-to-lyrics illusion for demonstrative sentences. Lines represent averages, shaded area represents 95% CIs.

A Rational Analysis of the Speech-to-Song Illusion

TL;DR

Abstract

A Rational Analysis of the Speech-to-Song Illusion

Authors

TL;DR

Abstract

Table of Contents

Figures (5)