Table of Contents
Fetching ...

AmbigDocs: Reasoning across Documents on Different Entities under the Same Name

Yoonsang Lee, Xi Ye, Eunsol Choi

TL;DR

AmbigDocs introduces a large-scale synthetic benchmark to study reasoning across documents when multiple entities share a surface name. By leveraging Wikipedia disambiguation pages, it pairs ambiguous questions with sets of gold documents, each grounding a distinct disambiguated entity and its answer, and expands to many disambiguated answers per surface name. The authors define a five-type answer ontology (Complete, Partial, No, Ambiguous, Merged) and develop automatic heuristics for categorization, alongside token-recall and Disambig-F1 metrics to evaluate performance. Empirical results show current LMs struggle to produce complete, disambiguated long-form responses, though in-domain few-shot prompting improves outcomes; the work lays groundwork for future research in multi-document reasoning under entity ambiguity and suggests directions for broader applications like fact verification and multi-document summarization.

Abstract

Different entities with the same name can be difficult to distinguish. Handling confusing entity mentions is a crucial skill for language models (LMs). For example, given the question "Where was Michael Jordan educated?" and a set of documents discussing different people named Michael Jordan, can LMs distinguish entity mentions to generate a cohesive answer to the question? To test this ability, we introduce a new benchmark, AmbigDocs. By leveraging Wikipedia's disambiguation pages, we identify a set of documents, belonging to different entities who share an ambiguous name. From these documents, we generate questions containing an ambiguous name and their corresponding sets of answers. Our analysis reveals that current state-of-the-art models often yield ambiguous answers or incorrectly merge information belonging to different entities. We establish an ontology categorizing four types of incomplete answers and automatic evaluation metrics to identify such categories. We lay the foundation for future work on reasoning across multiple documents with ambiguous entities.

AmbigDocs: Reasoning across Documents on Different Entities under the Same Name

TL;DR

AmbigDocs introduces a large-scale synthetic benchmark to study reasoning across documents when multiple entities share a surface name. By leveraging Wikipedia disambiguation pages, it pairs ambiguous questions with sets of gold documents, each grounding a distinct disambiguated entity and its answer, and expands to many disambiguated answers per surface name. The authors define a five-type answer ontology (Complete, Partial, No, Ambiguous, Merged) and develop automatic heuristics for categorization, alongside token-recall and Disambig-F1 metrics to evaluate performance. Empirical results show current LMs struggle to produce complete, disambiguated long-form responses, though in-domain few-shot prompting improves outcomes; the work lays groundwork for future research in multi-document reasoning under entity ambiguity and suggests directions for broader applications like fact verification and multi-document summarization.

Abstract

Different entities with the same name can be difficult to distinguish. Handling confusing entity mentions is a crucial skill for language models (LMs). For example, given the question "Where was Michael Jordan educated?" and a set of documents discussing different people named Michael Jordan, can LMs distinguish entity mentions to generate a cohesive answer to the question? To test this ability, we introduce a new benchmark, AmbigDocs. By leveraging Wikipedia's disambiguation pages, we identify a set of documents, belonging to different entities who share an ambiguous name. From these documents, we generate questions containing an ambiguous name and their corresponding sets of answers. Our analysis reveals that current state-of-the-art models often yield ambiguous answers or incorrectly merge information belonging to different entities. We establish an ontology categorizing four types of incomplete answers and automatic evaluation metrics to identify such categories. We lay the foundation for future work on reasoning across multiple documents with ambiguous entities.
Paper Structure (49 sections, 13 figures, 11 tables)

This paper contains 49 sections, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Given a question containing an ambiguous entity mention (e.g., Judge Day) and a set of documents containing a valid answer to different disambiguated entities, LLM should generate a complete answer, pairing each disambiguated entity with its answer. Left bottom box: we first evaluate $y$ with respect to a single input document, checking if $y$ contains the answer with its disambiguated entity name (e.g., for the Doc 1, it should mention the answer "Alabama" and disambiguated entity name "Charles Bernard Day"). Based on per-document scores, we can assign one of the answer category labels (partial, no, ambiguous, merged, complete).
  • Figure 1: Distribution of the number of answers per question (#ans) in AmbigDocs. The maximum number was 10.
  • Figure 2: Overview of our dataset generation. We identify a surface name and a list of disambiguated entities from Wikipedia's disambiguation pages. We select two documents for generating a question and their corresponding answers. Subsequently, we gather additional answers from the remaining documents.
  • Figure 3: Treemap illustrating the distribution of "Wh" questions in AmbigDocs. Each box's size corresponds to its frequency. Questions starting with "What" are the most prevalent, reflecting the dataset's focus on entities.
  • Figure 4: Distribution of answer categories (% in each box) under Gold Only setting. Different LMs show different failure modes.
  • ...and 8 more figures