Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment

William Merrill; Zhaofeng Wu; Norihito Naka; Yoon Kim; Tal Linzen

Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment

William Merrill, Zhaofeng Wu, Norihito Naka, Yoon Kim, Tal Linzen

TL;DR

The study investigates whether next-token predictions in language models encode sentence-level entailment via distributional semantics. By formalizing an entailment score from log-probabilities and evaluating it across diverse benchmarks and LM families, the authors show that the test detects entailment above chance, though the empirical direction is flipped relative to the theoretical Gricean model. A mix of corpus analysis and adapted pragmatic theories reveals that natural human language exhibits redundancy beyond the Gricean assumption, suggesting explanations-based and noise-tolerance accounts may be needed. The findings challenge a purely redundancy-minimizing view and highlight the potential of using large corpora and LM probabilities to test pragmatics and semantics, while pointing to future work on accounting for redundancy in computational models of language. The work implies that distributional signals can inform semantic inferences but require reinterpretation of underlying speaker models to align with human language use, with practical impact on evaluation of entailment and the behavioral interpretation of LM probabilities.

Abstract

Do LMs infer the semantics of text from co-occurrence patterns in their training data? Merrill et al. (2022) argue that, in theory, sentence co-occurrence probabilities predicted by an optimal LM should reflect the entailment relationship of the constituent sentences, but it is unclear whether probabilities predicted by neural LMs encode entailment in this way because of strong assumptions made by Merrill et al. (namely, that humans always avoid redundancy). In this work, we investigate whether their theory can be used to decode entailment relations from neural LMs. We find that a test similar to theirs can decode entailment relations between natural sentences, well above random chance, though not perfectly, across many datasets and LMs. This suggests LMs implicitly model aspects of semantics to predict semantic effects on sentence co-occurrence patterns. However, we find the test that predicts entailment in practice works in the opposite direction to the theoretical test. We thus revisit the assumptions underlying the original test, finding its derivation did not adequately account for redundancy in human-written text. We argue that better accounting for redundancy related to explanations might derive the observed flipped test and, more generally, improve computational models of speakers in linguistics.

Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment

TL;DR

Abstract

Paper Structure (44 sections, 3 theorems, 19 equations, 19 figures, 2 tables)

This paper contains 44 sections, 3 theorems, 19 equations, 19 figures, 2 tables.

Introduction
Distributional Semantics and the Entailment Test
The Entailment Test
Gricean Speakers.
Entailment Test.
Evaluating the Entailment Test
Entailment Datasets
Models
Evaluation Metric: Flipped ROC-AUC
Entailment Test Results
Flipped Test on Broad-Coverage Data
Varied Pattern for Targeted Phenomena
Learning a Distributional Entailment Test
Setup.
Results.
...and 29 more sections

Key Result

Proposition 1

Let $p$ be a Gricean speaker. Then, for any $x, y$, $\hat{E}_p(x, y) = E(x, y)$.

Figures (19)

Figure 1: Entailment score $\hat{E}_p(x, y)$ distribution computed with Llama2-70b probabilities on RTE. The score discriminates the two classes, though imperfectly.
Figure 2: Flipped AUC-ROC scores for the entailment test across datasets using Llama2-70b probabilities. The flipped test generally performs above random (=50) and the length baseline, while the original test works better for connectives ($<$50 Flipped ROC-AUC).
Figure 3: C4 validation bits per byte vs. flipped AUC-ROC score for all models on broad-coverage and targeted datasets. Note that the scale of the $y$-axis differs for each subplot. See \ref{['fig:aucroc-histogram']} for a scale-controlled version of Llama2-70b results. For broad-coverage datasets, model quality (represented by bits per byte, lower is better) clearly correlates with flipped test performance, though this is more complicated for the targeted test sets.
Figure 4: Flipped ROC-AUC of entailment score across Pythia-12b checkpoints. Each step is around 2M tokens.
Figure 5: Learned logistic regression coefficients for the log-prob features for the broad-coverage datasets. Each bar represents one LM. For ease of visualization, $y$-axis is in log scale, except in $[-0.1, 0.1]$ where it is linear.
...and 14 more figures

Theorems & Definitions (6)

Proposition 1: merrill-etal-2022-entailment
proof
Proposition 2
proof
Proposition 3
proof

Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment

TL;DR

Abstract

Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (6)