Scope Ambiguities in Large Language Models
Gaurav Kamath, Sebastian Schuster, Sowmya Vajjala, Siva Reddy
TL;DR
This work investigates how large language models process scope ambiguities—statements where multiple readings of semantic operators yield different interpretations. It introduces novel datasets totaling roughly 1,000 scope-ambiguous sentences with human judgments and employs two complementary experiments: a Q&A reading-preference task and a probabilistic continuation task using an ambiguity metric α. The results show that several models, including GPT-4 and GPT-3.5 variants as well as large Llama 2 models, can align with human readings and, in some cases, achieve high accuracy in identifying preferred interpretations; correlations with human judgments are strongest for certain models (e.g., text-davinci-003 and Llama 2 13B). These findings suggest that modern LLMs can capture and utilize scope-related semantic structure, though methodological differences across studies explain disparate conclusions in the literature. The work provides a substantial resource for evaluating scope representation in LLMs and highlights the value of diverse evaluation methods for probing linguistic capabilities beyond surface performance.
Abstract
Sentences containing multiple semantic operators with overlapping scope often create ambiguities in interpretation, known as scope ambiguities. These ambiguities offer rich insights into the interaction between semantic structure and world knowledge in language processing. Despite this, there has been little research into how modern large language models treat them. In this paper, we investigate how different versions of certain autoregressive language models -- GPT-2, GPT-3/3.5, Llama 2 and GPT-4 -- treat scope ambiguous sentences, and compare this with human judgments. We introduce novel datasets that contain a joint total of almost 1,000 unique scope-ambiguous sentences, containing interactions between a range of semantic operators, and annotated for human judgments. Using these datasets, we find evidence that several models (i) are sensitive to the meaning ambiguity in these sentences, in a way that patterns well with human judgments, and (ii) can successfully identify human-preferred readings at a high level of accuracy (over 90% in some cases).
