Table of Contents
Fetching ...

Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

Ashley M. A. Fehr, Calla G. Beauregard, Julia Witte Zimmerman, Katie Ekström, Pablo Rosillo-Rodes, Christopher M. Danforth, Peter Sheridan Dodds

TL;DR

Problem: Do Heaps' and Zipf's laws apply to both naturalistic and scripted conversations, and how does medium shape meaning construction? Approach: analyze CANDOR and Movie-Dialogs datasets with POS tagging, Heaps'/Zipf' scaling, interarrival times, burstiness, memory, and type-token measures to compare naturalistic versus scripted dialogue. Findings: all corpora show sublinear Heaps' scaling with betas around 0.63–0.65; open-class nouns/verbs drive higher exponents; interjections and function words display medium-specific patterns; lexical diversity varies across corpora. Significance: results suggest shared efficiency-driven constraints across modalities while highlighting how medium-specific affordances influence linguistic unit roles, informing theories of meaning-making and dialogical communication across modalities.

Abstract

Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps' law, which holds that vocabulary size scales with document length. Little work on Heaps's law has looked at conversation and considered how language features impact scaling. We measure Heaps' law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.

Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

TL;DR

Problem: Do Heaps' and Zipf's laws apply to both naturalistic and scripted conversations, and how does medium shape meaning construction? Approach: analyze CANDOR and Movie-Dialogs datasets with POS tagging, Heaps'/Zipf' scaling, interarrival times, burstiness, memory, and type-token measures to compare naturalistic versus scripted dialogue. Findings: all corpora show sublinear Heaps' scaling with betas around 0.63–0.65; open-class nouns/verbs drive higher exponents; interjections and function words display medium-specific patterns; lexical diversity varies across corpora. Significance: results suggest shared efficiency-driven constraints across modalities while highlighting how medium-specific affordances influence linguistic unit roles, informing theories of meaning-making and dialogical communication across modalities.

Abstract

Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps' law, which holds that vocabulary size scales with document length. Little work on Heaps's law has looked at conversation and considered how language features impact scaling. We measure Heaps' law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.

Paper Structure

This paper contains 36 sections, 2 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: An analogy wherein information is to meaning as lemons are to lemonade. The highlighted word stem 'scoop' in the top and bottom conversation bubbles demonstrate that context changes the word's meaning---the differing contexts being ice cream (top) compared to someone publishing your idea before you, i.e., being 'scooped' (bottom).
  • Figure 2: Words by utterances heatmap per corpus with Pearson correlation coefficients. The Movies (individual) corpus is subsetted to include conversations having a minimum of 10 utterances.
  • Figure 3: Normalized part of speech corpus proportions as total words (A) and unique words (B). A corpus' total and unique proportions each sum to $1.0$ within corpus. Any likeness to country flag is coincidental in our endeavor to use accessible color scheming.
  • Figure 4: Zipf's law (A) and Heaps' law (B) for the corpora. The grey dashed line indicates (coefficient $=1$), and the solid black line indicates the scaling regime restriction on the (A) horizontal axis or (B) vertical axis. Standard error of the slope: no notation for SES $<.01$, ^ for $.05 > \text{SES} \geq .01$, and $\ast$ for SES $> .05$.
  • Figure 5: Part of speech linear regressions for the corpora. Below the grey dashed line ($\beta=1$) is sublinear scaling, and the solid black line indicates the scaling regime restriction on the vertical axis. Standard error of the slope: no notation for SES $<.01$, ^ for $.05 > \text{SES} \geq .01$, and $\ast$ for SES $> .05$.
  • ...and 9 more figures