Table of Contents
Fetching ...

Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs

Amber Shore, Russell Scheinberg, Ameeta Agrawal, So Young Lee

TL;DR

This work investigates how large language models handle referential ambiguity, focusing on coreference resolution under limited context. It introduces AmbiCoref and a Reflect prompt framework to compare LLMs with human judgments across unambiguous and ambiguous sentences, formalizing Correct-Unamb and Detect-Ambig metrics. A persistent trade-off emerges: models that excel at unambiguous resolution struggle to detect true ambiguity, and prompts that boost ambiguity detection often reduce unambiguous accuracy, with GPT-4o generally outperforming Llama 3.1 but neither achieving a Pareto-optimal balance; analyses also reveal biases (e.g., gender) and divergent human-model response patterns. The findings highlight incentive-driven challenges in current training and evaluation and point to future directions for creating ambiguity-aware, human-like processing in LLMs.

Abstract

Large Language Models (LLMs) are intended to reflect human linguistic competencies. But humans have access to a broad and embodied context, which is key in detecting and resolving linguistic ambiguities, even in isolated text spans. A foundational case of semantic ambiguity is found in the task of coreference resolution: how is a pronoun related to an earlier person mention? This capability is implicit in nearly every downstream task, and the presence of ambiguity at this level can alter performance significantly. We show that LLMs can achieve good performance with minimal prompting in both coreference disambiguation and the detection of ambiguity in coreference, however, they cannot do both at the same time. We present the CORRECT-DETECT trade-off: though models have both capabilities and deploy them implicitly, successful performance balancing these two abilities remains elusive.

Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs

TL;DR

This work investigates how large language models handle referential ambiguity, focusing on coreference resolution under limited context. It introduces AmbiCoref and a Reflect prompt framework to compare LLMs with human judgments across unambiguous and ambiguous sentences, formalizing Correct-Unamb and Detect-Ambig metrics. A persistent trade-off emerges: models that excel at unambiguous resolution struggle to detect true ambiguity, and prompts that boost ambiguity detection often reduce unambiguous accuracy, with GPT-4o generally outperforming Llama 3.1 but neither achieving a Pareto-optimal balance; analyses also reveal biases (e.g., gender) and divergent human-model response patterns. The findings highlight incentive-driven challenges in current training and evaluation and point to future directions for creating ambiguity-aware, human-like processing in LLMs.

Abstract

Large Language Models (LLMs) are intended to reflect human linguistic competencies. But humans have access to a broad and embodied context, which is key in detecting and resolving linguistic ambiguities, even in isolated text spans. A foundational case of semantic ambiguity is found in the task of coreference resolution: how is a pronoun related to an earlier person mention? This capability is implicit in nearly every downstream task, and the presence of ambiguity at this level can alter performance significantly. We show that LLMs can achieve good performance with minimal prompting in both coreference disambiguation and the detection of ambiguity in coreference, however, they cannot do both at the same time. We present the CORRECT-DETECT trade-off: though models have both capabilities and deploy them implicitly, successful performance balancing these two abilities remains elusive.

Paper Structure

This paper contains 28 sections, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Our results suggest a trade-off between coreference resolution accuracy in unambiguous sentences (Correct Unamb) and reliable ambiguity detection in ambiguous sentences (Detect Ambig).
  • Figure 2: Reflect: Model results compared to human answer patterns across unambiguous and ambiguous sets. Human responses exhibit a U-shaped distribution for unambiguous sentences, with a decreased preference for 'ambiguous' and 'probably' answers (?, A?, B?). This pattern reverses for ambiguous sentences, forming an inverted U-shaped distribution due to an increased preference for the 'ambiguous' (?) responses. The models do not strongly track this pattern: Llama 3.1 fails to adjust in the ambiguous set, while GPT-4o slightly increases '?' responses.
  • Figure 3: Detect-Ambig plotted against Correct-Unamb. 0: Basic, 1: Ambi-Ask, 2: Ambi-Stop, 3: Ambi-Wait, 4: Ambi-Doubt, 5: Ambi-CoT. H shows the human results for near-correctness.
  • Figure 4: Reflect: Unambiguous Llama 3.1 70b results. Dark blue are model results, light blue are averaged human judgments.
  • Figure 5: Reflect: Ambiguous Llama 3.1 70b results. Dark blue are model results, light blue are averaged human judgments.
  • ...and 2 more figures