Table of Contents
Fetching ...

Evaluating D-MERIT of Partial-annotation on Information Retrieval

Royi Rassin, Yaron Fairstein, Oren Kalinsky, Guy Kushilevitz, Nachshon Cohen, Alexander Libov, Yoav Goldberg

TL;DR

It is shown that evaluating on a dataset containing annotations for only a subset of the relevant passages might result in misleading ranking of the retrieval systems and that as more relevant texts are included in the evaluation set, the rankings converge.

Abstract

Retrieval models are often evaluated on partially-annotated datasets. Each query is mapped to a few relevant texts and the remaining corpus is assumed to be irrelevant. As a result, models that successfully retrieve false negatives are punished in evaluation. Unfortunately, completely annotating all texts for every query is not resource efficient. In this work, we show that using partially-annotated datasets in evaluation can paint a distorted picture. We curate D-MERIT, a passage retrieval evaluation set from Wikipedia, aspiring to contain all relevant passages for each query. Queries describe a group (e.g., "journals about linguistics") and relevant passages are evidence that entities belong to the group (e.g., a passage indicating that "Language" is a journal about linguistics). We show that evaluating on a dataset containing annotations for only a subset of the relevant passages might result in misleading ranking of the retrieval systems and that as more relevant texts are included in the evaluation set, the rankings converge. We propose our dataset as a resource for evaluation and our study as a recommendation for balance between resource-efficiency and reliable evaluation when annotating evaluation sets for text retrieval.

Evaluating D-MERIT of Partial-annotation on Information Retrieval

TL;DR

It is shown that evaluating on a dataset containing annotations for only a subset of the relevant passages might result in misleading ranking of the retrieval systems and that as more relevant texts are included in the evaluation set, the rankings converge.

Abstract

Retrieval models are often evaluated on partially-annotated datasets. Each query is mapped to a few relevant texts and the remaining corpus is assumed to be irrelevant. As a result, models that successfully retrieve false negatives are punished in evaluation. Unfortunately, completely annotating all texts for every query is not resource efficient. In this work, we show that using partially-annotated datasets in evaluation can paint a distorted picture. We curate D-MERIT, a passage retrieval evaluation set from Wikipedia, aspiring to contain all relevant passages for each query. Queries describe a group (e.g., "journals about linguistics") and relevant passages are evidence that entities belong to the group (e.g., a passage indicating that "Language" is a journal about linguistics). We show that evaluating on a dataset containing annotations for only a subset of the relevant passages might result in misleading ranking of the retrieval systems and that as more relevant texts are included in the evaluation set, the rankings converge. We propose our dataset as a resource for evaluation and our study as a recommendation for balance between resource-efficiency and reliable evaluation when annotating evaluation sets for text retrieval.
Paper Structure (48 sections, 5 equations, 11 figures, 9 tables)

This paper contains 48 sections, 5 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Demonstrating the evidence retrieval task described in \ref{['sec:evidence_retrieval_task']}. The query is "Names of first world war camoufleurs". Highlighted text corresponds to the query requirements: names (green), "First World War" (red), and "camouflage" (orange). A passage must match all requirements to be considered as evidence.
  • Figure 2: Selection techniques for a single-relevant setting. The x-axis denotes systems used to select passages for annotation. Each tick represents the performance of systems on the same dataset with different annotations. An intersection demonstrates a swap in rankings.
  • Figure 3: Partial-Kendall-$\tau$ between rankings of systems with $k$ percent annotations and ranking with all evidence, using recall@20. System pairs are divided into 3 buckets as described in \ref{['sec:buckets_exp']}.
  • Figure 4: Concordance between rankings of systems with varying percentages of evidence and ranking with all evidence, using recall@5, recall@20, recall@50, and recall@100. System pairs are divided into 3 buckets as described in \ref{['sec:buckets_exp']}.
  • Figure 5: Fraction of relevant passages covered by top-10 passages for $s$ systems.
  • ...and 6 more figures