Natural Language Decompositions of Implicit Content Enable Better Text Representations

Alexander Hoyle; Rupak Sarkar; Pranav Goel; Philip Resnik

Natural Language Decompositions of Implicit Content Enable Better Text Representations

Alexander Hoyle, Rupak Sarkar, Pranav Goel, Philip Resnik

TL;DR

This work introduces inferential decompositions: explicit, language-based representations of both explicit and implicit propositions related to utterances, generated by exemplar-guided prompting of large language models. By validating plausibility with human judgments and testing across domains (public opinion clustering and legislator co-voting), the approach demonstrates that incorporating implicit content can improve interpretability and some downstream tasks, particularly in theme discovery and socially-relevant analyses. While offering gains in semantic similarity on argument- and Twitter-based STS tasks and enabling richer narrative discovery, the method shows mixed results on standard STS benchmarks, underscoring task-dependent benefits. Overall, treating implicit content as a first-class citizen in NLP enables more nuanced analyses of text as data and holds promise for social science applications requiring deeper interpretation of utterances.

Abstract

When people interpret text, they rely on inferences that go beyond the observed language itself. Inspired by this observation, we introduce a method for the analysis of text that takes implicitly communicated content explicitly into account. We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed, then validate the plausibility of the generated content via human judgments. Incorporating these explicit representations of implicit content proves useful in multiple problem settings that involve the human interpretation of utterances: assessing the similarity of arguments, making sense of a body of opinion data, and modeling legislative behavior. Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP and particularly its applications to social science.

Natural Language Decompositions of Implicit Content Enable Better Text Representations

TL;DR

Abstract

Paper Structure (33 sections, 1 equation, 11 figures, 12 tables)

This paper contains 33 sections, 1 equation, 11 figures, 12 tables.

Introduction
The Method and its Rationale
Generation Validity
Generation of decompositions.
Human annotation of plausiblity.
Semantic Textual Similarity.
Inferential Decompositions Help Theme Discovery
Dataset.
Method.
Automated Evaluation.
Human Evaluation.
Convergent Validity.
Decompositions support analyses of legislator behavior
Model Setup.
Dataset.
...and 18 more sections

Figures (11)

Figure 1: Example showing a pair of tweets from legislators along with inferentially-related propositions. Embeddings over observed tweets have a high cosine distance, while embeddings over (different types of) propositions place them closer to each other (see \ref{['sec:covote_prediction']}).
Figure 2: A condensed version of our prompt to models.
Figure 3: Human-judged plausibility of inferentially-related propositions for different inference types. The vast majority of inferred propositions are plausible.
Figure 4: Human evaluation of clustering outputs. Clusters of decompositions (our method) take significantly less time to review and are more distinctive from one another. Relatedness scores are high for the observed comments, but significantly worse membership identification scores reveal this to be a spurious result owed to the topical homogeneity of the dataset (all comments are about COVID vaccines). All differences are significant at $p < 0.05$ except membership scores between comments and sentences and evaluation times for sentences and decompositions.
Figure 5: t-SNE tsne2008 visualization of the embedding space of implicit inferred decompositions found in the "Abortion" topic from legislative tweets. $\bigstar$ and ✖ are two clusters selected from 10 clusters obtained using $K$-means; $\bigstar$ (59% Democrat) talks about the role of judiciary in reproductive rights, while the ✖ (73% Republican) talks about banning late stage abortion. Our method leads to more compact (better Silhouette, CH, and DB scores compared to tweets) and easier to interpret clusters that help with narrative discovery.
...and 6 more figures

Natural Language Decompositions of Implicit Content Enable Better Text Representations

TL;DR

Abstract

Natural Language Decompositions of Implicit Content Enable Better Text Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (11)