Table of Contents
Fetching ...

Scope Ambiguities in Large Language Models

Gaurav Kamath, Sebastian Schuster, Sowmya Vajjala, Siva Reddy

TL;DR

This work investigates how large language models process scope ambiguities—statements where multiple readings of semantic operators yield different interpretations. It introduces novel datasets totaling roughly 1,000 scope-ambiguous sentences with human judgments and employs two complementary experiments: a Q&A reading-preference task and a probabilistic continuation task using an ambiguity metric α. The results show that several models, including GPT-4 and GPT-3.5 variants as well as large Llama 2 models, can align with human readings and, in some cases, achieve high accuracy in identifying preferred interpretations; correlations with human judgments are strongest for certain models (e.g., text-davinci-003 and Llama 2 13B). These findings suggest that modern LLMs can capture and utilize scope-related semantic structure, though methodological differences across studies explain disparate conclusions in the literature. The work provides a substantial resource for evaluating scope representation in LLMs and highlights the value of diverse evaluation methods for probing linguistic capabilities beyond surface performance.

Abstract

Sentences containing multiple semantic operators with overlapping scope often create ambiguities in interpretation, known as scope ambiguities. These ambiguities offer rich insights into the interaction between semantic structure and world knowledge in language processing. Despite this, there has been little research into how modern large language models treat them. In this paper, we investigate how different versions of certain autoregressive language models -- GPT-2, GPT-3/3.5, Llama 2 and GPT-4 -- treat scope ambiguous sentences, and compare this with human judgments. We introduce novel datasets that contain a joint total of almost 1,000 unique scope-ambiguous sentences, containing interactions between a range of semantic operators, and annotated for human judgments. Using these datasets, we find evidence that several models (i) are sensitive to the meaning ambiguity in these sentences, in a way that patterns well with human judgments, and (ii) can successfully identify human-preferred readings at a high level of accuracy (over 90% in some cases).

Scope Ambiguities in Large Language Models

TL;DR

This work investigates how large language models process scope ambiguities—statements where multiple readings of semantic operators yield different interpretations. It introduces novel datasets totaling roughly 1,000 scope-ambiguous sentences with human judgments and employs two complementary experiments: a Q&A reading-preference task and a probabilistic continuation task using an ambiguity metric α. The results show that several models, including GPT-4 and GPT-3.5 variants as well as large Llama 2 models, can align with human readings and, in some cases, achieve high accuracy in identifying preferred interpretations; correlations with human judgments are strongest for certain models (e.g., text-davinci-003 and Llama 2 13B). These findings suggest that modern LLMs can capture and utilize scope-related semantic structure, though methodological differences across studies explain disparate conclusions in the literature. The work provides a substantial resource for evaluating scope representation in LLMs and highlights the value of diverse evaluation methods for probing linguistic capabilities beyond surface performance.

Abstract

Sentences containing multiple semantic operators with overlapping scope often create ambiguities in interpretation, known as scope ambiguities. These ambiguities offer rich insights into the interaction between semantic structure and world knowledge in language processing. Despite this, there has been little research into how modern large language models treat them. In this paper, we investigate how different versions of certain autoregressive language models -- GPT-2, GPT-3/3.5, Llama 2 and GPT-4 -- treat scope ambiguous sentences, and compare this with human judgments. We introduce novel datasets that contain a joint total of almost 1,000 unique scope-ambiguous sentences, containing interactions between a range of semantic operators, and annotated for human judgments. Using these datasets, we find evidence that several models (i) are sensitive to the meaning ambiguity in these sentences, in a way that patterns well with human judgments, and (ii) can successfully identify human-preferred readings at a high level of accuracy (over 90% in some cases).
Paper Structure (31 sections, 4 figures, 7 tables)

This paper contains 31 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: A high-level overview of our study, showing our approaches to our first (see Section \ref{['sec:experiment1a']}) and second (see Section \ref{['sec:exp2a']}) experiments.
  • Figure 2: An example of stimuli provided to models in Experiments 1A and 1B. The sections highlighted in bold are taken from our Experiment 1A dataset, and vary between individual stimuli presented to the models. The non-highlighted sections, which act as a prompt frame, remain fixed. For chat-optimized models, we solicit the model's response using the question highlighted in blue; for plain autoregressive models, we solicit the model's response by seeing what it predicts after the sequence highlighted in orange. In the control setting, the ambiguous sentence is dropped.
  • Figure 3: Experiment 2A and 2B set-up, comprising of an ambiguous sentence $S$, unambiguous control $S_{c}$, and two follow-ups, $F_{1}$ and $F_{2}$, demonstrated using an example from our manually constructed dataset. We compare the probabilities a model assigns to $F_{1}$ and $F_{2}$ as continuations to $S$, versus as continuations to $S_{c}$.
  • Figure 4: From Experiment 2A— scatterplot of $\alpha$-scores produced by text-davinci-003, against human proxy scores for the same datapoints.