Table of Contents
Fetching ...

From "um" to "yeah": Producing, predicting, and regulating information flow in human conversation

Claire Augusta Bergey, Simon DeDeo

TL;DR

This work investigates how information flows in natural conversation by estimating surprisal-based information density from a large-scale naturalistic corpus (CANDOR) using long-context next-word predictions, yielding an average rate of $13.21$ bits/second and $4.04$ bits/word. It demonstrates that high-surprise words are produced more slowly due to both lexical-length and context-driven mechanisms, and that listeners’ backchannels regulate information flow by signaling upcoming shifts to novel material. The study also shows that retrieval and production impose distinct cognitive costs, with disfluencies acting as time-buying devices that predict increased upcoming surprisal. Overall, the findings support resource-limited models of conversational dynamics while highlighting the need for multiscale, non-Shannon-based frameworks to fully capture the variability and coordination in human dialogue.

Abstract

Conversation demands attention. Speakers must call words to mind, listeners must make sense of them, and both together must negotiate this flow of information, all in fractions of a second. We used large language models to study how this works in a large-scale dataset of English-language conversation, the CANDOR corpus. We provide a new estimate of the information density of unstructured conversation, of approximately 13 bits/second, and find significant effects associated with the cognitive load of both retrieving, and presenting, that information. We also reveal a role for backchannels -- the brief yeahs, uh-huhs, and mhmms that listeners provide -- in regulating the production of novelty: the lead-up to a backchannel is associated with declining information rate, while speech downstream rebounds to previous rates. Our results provide new insights into long-standing theories of how we respond to fluctuating demands on cognitive resources, and how we negotiate those demands in partnership with others.

From "um" to "yeah": Producing, predicting, and regulating information flow in human conversation

TL;DR

This work investigates how information flows in natural conversation by estimating surprisal-based information density from a large-scale naturalistic corpus (CANDOR) using long-context next-word predictions, yielding an average rate of bits/second and bits/word. It demonstrates that high-surprise words are produced more slowly due to both lexical-length and context-driven mechanisms, and that listeners’ backchannels regulate information flow by signaling upcoming shifts to novel material. The study also shows that retrieval and production impose distinct cognitive costs, with disfluencies acting as time-buying devices that predict increased upcoming surprisal. Overall, the findings support resource-limited models of conversational dynamics while highlighting the need for multiscale, non-Shannon-based frameworks to fully capture the variability and coordination in human dialogue.

Abstract

Conversation demands attention. Speakers must call words to mind, listeners must make sense of them, and both together must negotiate this flow of information, all in fractions of a second. We used large language models to study how this works in a large-scale dataset of English-language conversation, the CANDOR corpus. We provide a new estimate of the information density of unstructured conversation, of approximately 13 bits/second, and find significant effects associated with the cognitive load of both retrieving, and presenting, that information. We also reveal a role for backchannels -- the brief yeahs, uh-huhs, and mhmms that listeners provide -- in regulating the production of novelty: the lead-up to a backchannel is associated with declining information rate, while speech downstream rebounds to previous rates. Our results provide new insights into long-standing theories of how we respond to fluctuating demands on cognitive resources, and how we negotiate those demands in partnership with others.
Paper Structure (8 sections, 4 figures)

This paper contains 8 sections, 4 figures.

Figures (4)

  • Figure 1: A sample fragment from a processed version of the CANDOR corpus, with LLM-estimated surprise in bits and AWS-estimated word lengths in milliseconds. Grey bars mark pauses greater than 10 ms. For example, the word "like" follows the word "drugs" after a pause of 310 ms; the word lasts 210 ms, and has surprise $2.7$ bits. In this example, the second speaker provides a backchannel ("yeah") that begins while the first speaker is finishing the word "and", and lasts 590 ms, through into the pause following "eventually". The LLM context window is 128 tokens.
  • Figure 2: The distribution of information rates in continuous conversational English, CANDOR corpus (659,496 words, 54,958 sequences). Solid line: information density of single words; dashed line: average density over the course of twelve consecutive words.
  • Figure 3: Words with higher surprise take longer to say. This holds not only in the aggregate (solid line), but even for the same word: when a word appears in a higher-surprise context, it tends to be said more slowly than it is in a low-surprise context. Dashed lines show three examples of this effect for common words: "you" (N=14,150), "if" (N=2701), and "people" (N=2534), binned by decile in timing.
  • Figure 4: The timeseries of surprise for continuous back-channeled speech (solid black line) relative to the base model (blue line), and relative to a sample from the base model that matches the surprise distribution at backchannel (red line), with standard errors. The backchannel occurs at word position zero (e.g., in Fig. \ref{['sentence']}, the word "and"). All: all backchannels; Fluent: a backchannel that follows a word with a gap less than 15 ms.