Table of Contents
Fetching ...

Experimental Pragmatics with Machines: Testing LLM Predictions for the Inferences of Plain and Embedded Disjunctions

Polina Tsvilodub, Paul Marty, Sonia Ramotowska, Jacopo Romoli, Michael Franke

TL;DR

The paper investigates whether state-of-the-art large language systems (LLSs) predict the fine-grained inferences associated with plain and embedded disjunctions—free choice (FC), ignorance (II), and distributive (DI)—in parity with human data and in line with competing theories (TIA, RIA, NIA). By recreating human mystery-box experiments in prompt-driven evaluations across diverse transformer-based systems and measuring predictive log-probabilities, the authors quantify alignment to human acceptance rates using $R^2$ and bootstrap confidence intervals. Findings show that top LLSs can reproduce human-like distinctions between disjunction inferences and scalar implicatures in several conditions, with model size often correlating with better fit, yet there is substantial variability across inferences and contexts, especially for negation and modal/nominal scopes. The study highlights the potential and limits of using LLSs to test linguistic theories, and proposes criteria for robust interpretation, such as cross-task consistency, full-distribution analysis, and capacity-aware benchmarking.

Abstract

Human communication is based on a variety of inferences that we draw from sentences, often going beyond what is literally said. While there is wide agreement on the basic distinction between entailment, implicature, and presupposition, the status of many inferences remains controversial. In this paper, we focus on three inferences of plain and embedded disjunctions, and compare them with regular scalar implicatures. We investigate this comparison from the novel perspective of the predictions of state-of-the-art large language models, using the same experimental paradigms as recent studies investigating the same inferences with humans. The results of our best performing models mostly align with those of humans, both in the large differences we find between those inferences and implicatures, as well as in fine-grained distinctions among different aspects of those inferences.

Experimental Pragmatics with Machines: Testing LLM Predictions for the Inferences of Plain and Embedded Disjunctions

TL;DR

The paper investigates whether state-of-the-art large language systems (LLSs) predict the fine-grained inferences associated with plain and embedded disjunctions—free choice (FC), ignorance (II), and distributive (DI)—in parity with human data and in line with competing theories (TIA, RIA, NIA). By recreating human mystery-box experiments in prompt-driven evaluations across diverse transformer-based systems and measuring predictive log-probabilities, the authors quantify alignment to human acceptance rates using and bootstrap confidence intervals. Findings show that top LLSs can reproduce human-like distinctions between disjunction inferences and scalar implicatures in several conditions, with model size often correlating with better fit, yet there is substantial variability across inferences and contexts, especially for negation and modal/nominal scopes. The study highlights the potential and limits of using LLSs to test linguistic theories, and proposes criteria for robust interpretation, such as cross-task consistency, full-distribution analysis, and capacity-aware benchmarking.

Abstract

Human communication is based on a variety of inferences that we draw from sentences, often going beyond what is literally said. While there is wide agreement on the basic distinction between entailment, implicature, and presupposition, the status of many inferences remains controversial. In this paper, we focus on three inferences of plain and embedded disjunctions, and compare them with regular scalar implicatures. We investigate this comparison from the novel perspective of the predictions of state-of-the-art large language models, using the same experimental paradigms as recent studies investigating the same inferences with humans. The results of our best performing models mostly align with those of humans, both in the large differences we find between those inferences and implicatures, as well as in fine-grained distinctions among different aspects of those inferences.
Paper Structure (9 sections, 1 equation, 3 figures, 4 tables)

This paper contains 9 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Example prompt for the LLM experiments. Some parts of the prompt are omitted for brevity (in gray). Boldface trigger sentence is an example from Degano:2023. Underlined sentence is the mystery box rule. Expressions in curly braces in gray vary by study. The character name is sampled at random by-trial. The likelihood for the last word (one of "good" / "bad", italicized) is retrieved for scoring the trigger, given the context.
  • Figure 2: Mean acceptance rate in the target conditions of each study by test case and source (LLMs or humans).
  • Figure 3: Human acceptance rate averaged over all items in each condition, by trigger type, plotted against model predictions (points). Lines indicate best linear model fit regressing human data against model predictions.