Table of Contents
Fetching ...

Do Language Models Know Theo Has a Wife? Investigating the Proviso Problem

Tara Azin, Daniel Dumitrescu, Diana Inkpen, Raj Singh

TL;DR

This work provides the first computational evaluation framework for the proviso problem and highlights the need for diagnostic, multi-method approaches to assess pragmatic competence and context-dependent meaning in language models.

Abstract

We investigate how language models handle the proviso problem, an unresolved issue in pragmatics where presuppositions in conditional sentences diverge between theoretical and human interpretations. We reformulate this phenomenon as a Natural Language Inference task and introduce a diagnostic dataset designed to probe presupposition projection in conditionals. We evaluate RoBERTa, DeBERTa, LLaMA, and Gemma using explainability analyses. The results show that models broadly align with human judgments but rely on shallow pattern matching rather than semantic or pragmatic reasoning. Our work provides the first computational evaluation framework for the proviso problem and highlights the need for diagnostic, multi-method approaches to assess pragmatic competence and context-dependent meaning in language models.

Do Language Models Know Theo Has a Wife? Investigating the Proviso Problem

TL;DR

This work provides the first computational evaluation framework for the proviso problem and highlights the need for diagnostic, multi-method approaches to assess pragmatic competence and context-dependent meaning in language models.

Abstract

We investigate how language models handle the proviso problem, an unresolved issue in pragmatics where presuppositions in conditional sentences diverge between theoretical and human interpretations. We reformulate this phenomenon as a Natural Language Inference task and introduce a diagnostic dataset designed to probe presupposition projection in conditionals. We evaluate RoBERTa, DeBERTa, LLaMA, and Gemma using explainability analyses. The results show that models broadly align with human judgments but rely on shallow pattern matching rather than semantic or pragmatic reasoning. Our work provides the first computational evaluation framework for the proviso problem and highlights the need for diagnostic, multi-method approaches to assess pragmatic competence and context-dependent meaning in language models.
Paper Structure (23 sections, 1 equation, 8 figures, 2 tables)

This paper contains 23 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The proviso problem illustrated. Given the sentence If Theo hates sonnets, so does his wife, formal semantic theories predict a conditional presupposition, while human speakers typically accommodate an unconditional presupposition. Our work investigates where language models fall in this theory-human divide.
  • Figure 2: IG visualization showing token-level attribution. Darker shading indicates higher IG values, with the presupposition trigger receiving the strongest attribution.
  • Figure 3: Three-way classification prompt used for zero-shot evaluation of Gemma and LLaMA models.
  • Figure 4: Accuracy across sentence types in Subset 2. The horizontal axis shows accuracy percentages, and the values inside the bars indicate the corresponding trigger IG ratios.
  • Figure 5: Models’ accuracies in zero-shot evaluation across four subsets, using human and theory-based labels.
  • ...and 3 more figures