Table of Contents
Fetching ...

That's Optional: A Contemporary Exploration of "that" Omission in English Subordinate Clauses

Ella Rabinovich

TL;DR

This study tests the Uniform Information Density (UID) hypothesis on syntactic reduction by analyzing the optional 'that' in English subordinate clauses using a large Reddit-derived corpus. It introduces entropy-based UID extensions and leverages pretrained LLMs to estimate SC onset surprisal and entropy, demonstrating that information-theoretic constraints predict 'that' omission beyond simple frequency effects. The results show significant, robust contributions from both SC onset surprisal and SC onset entropy across main-verb lemmas, supporting UID as a mechanism shaping syntactic choices in spontaneous, written language. The findings advance our understanding of how contemporary language models reflect information-structuring principles in real-world language use, with implications for linguistic theory and NLP applications.

Abstract

The Uniform Information Density (UID) hypothesis posits that speakers optimize the communicative properties of their utterances by avoiding spikes in information, thereby maintaining a relatively uniform information profile over time. This paper investigates the impact of UID principles on syntactic reduction, specifically focusing on the optional omission of the connector "that" in English subordinate clauses. Building upon previous research, we extend our investigation to a larger corpus of written English, utilize contemporary large language models (LLMs) and extend the information-uniformity principles by the notion of entropy, to estimate the UID manifestations in the usecase of syntactic reduction choices.

That's Optional: A Contemporary Exploration of "that" Omission in English Subordinate Clauses

TL;DR

This study tests the Uniform Information Density (UID) hypothesis on syntactic reduction by analyzing the optional 'that' in English subordinate clauses using a large Reddit-derived corpus. It introduces entropy-based UID extensions and leverages pretrained LLMs to estimate SC onset surprisal and entropy, demonstrating that information-theoretic constraints predict 'that' omission beyond simple frequency effects. The results show significant, robust contributions from both SC onset surprisal and SC onset entropy across main-verb lemmas, supporting UID as a mechanism shaping syntactic choices in spontaneous, written language. The findings advance our understanding of how contemporary language models reflect information-structuring principles in real-world language use, with implications for linguistic theory and NLP applications.

Abstract

The Uniform Information Density (UID) hypothesis posits that speakers optimize the communicative properties of their utterances by avoiding spikes in information, thereby maintaining a relatively uniform information profile over time. This paper investigates the impact of UID principles on syntactic reduction, specifically focusing on the optional omission of the connector "that" in English subordinate clauses. Building upon previous research, we extend our investigation to a larger corpus of written English, utilize contemporary large language models (LLMs) and extend the information-uniformity principles by the notion of entropy, to estimate the UID manifestations in the usecase of syntactic reduction choices.
Paper Structure (23 sections, 4 figures, 3 tables)

This paper contains 23 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Top-10 most frequent lemmas in the data; a bar height denotes the relative ratio out of the total, and each bar is split by the relative usage of explicit and implicit "that" SCONJ. Sentences with the top-10 lemmas account for 64.7% of all sentences in the dataset.
  • Figure 2: Kernel density estimation plots: SC onset surprisal for explicit and implicit "that" usages, using the full lemma set (A) and the "think" lemma (B). SC onset entropy for explicit and implicit "that" usages, for the full lemma set (C) and "think" main verb lemma only (D).
  • Figure 3: Constituency parse tree of the sentence "He's smart enough to know that you are a good catch.". Note the main verb "know" followed by the explicit SCONJ "that" and subordinate clause "you are a good catch".
  • Figure 4: Constituency parse tree of the sentence "yeah, that, and I think they got a lower rent price compared to the renewal downtown". Note the main verb "think" followed by the omitted SCONJ "that" and subordinate clause "they got a lower rent price compared to the renewal downtown".