Statistical laws and linguistics inform meaning in naturalistic and fictional conversation
Ashley M. A. Fehr, Calla G. Beauregard, Julia Witte Zimmerman, Katie Ekström, Pablo Rosillo-Rodes, Christopher M. Danforth, Peter Sheridan Dodds
TL;DR
Problem: Do Heaps' and Zipf's laws apply to both naturalistic and scripted conversations, and how does medium shape meaning construction? Approach: analyze CANDOR and Movie-Dialogs datasets with POS tagging, Heaps'/Zipf' scaling, interarrival times, burstiness, memory, and type-token measures to compare naturalistic versus scripted dialogue. Findings: all corpora show sublinear Heaps' scaling with betas around 0.63–0.65; open-class nouns/verbs drive higher exponents; interjections and function words display medium-specific patterns; lexical diversity varies across corpora. Significance: results suggest shared efficiency-driven constraints across modalities while highlighting how medium-specific affordances influence linguistic unit roles, informing theories of meaning-making and dialogical communication across modalities.
Abstract
Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps' law, which holds that vocabulary size scales with document length. Little work on Heaps's law has looked at conversation and considered how language features impact scaling. We measure Heaps' law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.
