That's Optional: A Contemporary Exploration of "that" Omission in English Subordinate Clauses
Ella Rabinovich
TL;DR
This study tests the Uniform Information Density (UID) hypothesis on syntactic reduction by analyzing the optional 'that' in English subordinate clauses using a large Reddit-derived corpus. It introduces entropy-based UID extensions and leverages pretrained LLMs to estimate SC onset surprisal and entropy, demonstrating that information-theoretic constraints predict 'that' omission beyond simple frequency effects. The results show significant, robust contributions from both SC onset surprisal and SC onset entropy across main-verb lemmas, supporting UID as a mechanism shaping syntactic choices in spontaneous, written language. The findings advance our understanding of how contemporary language models reflect information-structuring principles in real-world language use, with implications for linguistic theory and NLP applications.
Abstract
The Uniform Information Density (UID) hypothesis posits that speakers optimize the communicative properties of their utterances by avoiding spikes in information, thereby maintaining a relatively uniform information profile over time. This paper investigates the impact of UID principles on syntactic reduction, specifically focusing on the optional omission of the connector "that" in English subordinate clauses. Building upon previous research, we extend our investigation to a larger corpus of written English, utilize contemporary large language models (LLMs) and extend the information-uniformity principles by the notion of entropy, to estimate the UID manifestations in the usecase of syntactic reduction choices.
