Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs

Amber Shore; Russell Scheinberg; Ameeta Agrawal; So Young Lee

Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs

Amber Shore, Russell Scheinberg, Ameeta Agrawal, So Young Lee

TL;DR

This work investigates how large language models handle referential ambiguity, focusing on coreference resolution under limited context. It introduces AmbiCoref and a Reflect prompt framework to compare LLMs with human judgments across unambiguous and ambiguous sentences, formalizing Correct-Unamb and Detect-Ambig metrics. A persistent trade-off emerges: models that excel at unambiguous resolution struggle to detect true ambiguity, and prompts that boost ambiguity detection often reduce unambiguous accuracy, with GPT-4o generally outperforming Llama 3.1 but neither achieving a Pareto-optimal balance; analyses also reveal biases (e.g., gender) and divergent human-model response patterns. The findings highlight incentive-driven challenges in current training and evaluation and point to future directions for creating ambiguity-aware, human-like processing in LLMs.

Abstract

Large Language Models (LLMs) are intended to reflect human linguistic competencies. But humans have access to a broad and embodied context, which is key in detecting and resolving linguistic ambiguities, even in isolated text spans. A foundational case of semantic ambiguity is found in the task of coreference resolution: how is a pronoun related to an earlier person mention? This capability is implicit in nearly every downstream task, and the presence of ambiguity at this level can alter performance significantly. We show that LLMs can achieve good performance with minimal prompting in both coreference disambiguation and the detection of ambiguity in coreference, however, they cannot do both at the same time. We present the CORRECT-DETECT trade-off: though models have both capabilities and deploy them implicitly, successful performance balancing these two abilities remains elusive.

Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs

TL;DR

Abstract

Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)