Table of Contents
Fetching ...

NevIR: Negation in Neural Information Retrieval

Orion Weller, Dawn Lawrie, Benjamin Van Durme

TL;DR

NevIR introduces a targeted benchmark to evaluate negation handling in neural information retrieval by pairing minimally different documents that differ in negation and crowdsourcing queries to test whether models rank correctly for both. Across model families, cross-encoders perform best but still struggle, with many architectures performing at or below random baselines; ColBERT shows intermediate results and reveals issues in the MaxSim operator. The paper demonstrates that negation-specific fine-tuning yields meaningful gains, though large gains remain distant from human performance and first-stage retrieval remains a bottleneck. NevIR provides a valuable, contrastive dataset for training and evaluation to spur development of negation-aware IR systems, with implications for safer and more reliable search in high-stakes and everyday scenarios.

Abstract

Negation is a common everyday phenomena and has been a consistent area of weakness for language models (LMs). Although the Information Retrieval (IR) community has adopted LMs as the backbone of modern IR architectures, there has been little to no research in understanding how negation impacts neural IR. We therefore construct a straightforward benchmark on this theme: asking IR models to rank two documents that differ only by negation. We show that the results vary widely according to the type of IR architecture: cross-encoders perform best, followed by late-interaction models, and in last place are bi-encoder and sparse neural architectures. We find that most information retrieval models (including SOTA ones) do not consider negation, performing the same or worse than a random ranking. We show that although the obvious approach of continued fine-tuning on a dataset of contrastive documents containing negations increases performance (as does model size), there is still a large gap between machine and human performance.

NevIR: Negation in Neural Information Retrieval

TL;DR

NevIR introduces a targeted benchmark to evaluate negation handling in neural information retrieval by pairing minimally different documents that differ in negation and crowdsourcing queries to test whether models rank correctly for both. Across model families, cross-encoders perform best but still struggle, with many architectures performing at or below random baselines; ColBERT shows intermediate results and reveals issues in the MaxSim operator. The paper demonstrates that negation-specific fine-tuning yields meaningful gains, though large gains remain distant from human performance and first-stage retrieval remains a bottleneck. NevIR provides a valuable, contrastive dataset for training and evaluation to spur development of negation-aware IR systems, with implications for safer and more reliable search in high-stakes and everyday scenarios.

Abstract

Negation is a common everyday phenomena and has been a consistent area of weakness for language models (LMs). Although the Information Retrieval (IR) community has adopted LMs as the backbone of modern IR architectures, there has been little to no research in understanding how negation impacts neural IR. We therefore construct a straightforward benchmark on this theme: asking IR models to rank two documents that differ only by negation. We show that the results vary widely according to the type of IR architecture: cross-encoders perform best, followed by late-interaction models, and in last place are bi-encoder and sparse neural architectures. We find that most information retrieval models (including SOTA ones) do not consider negation, performing the same or worse than a random ranking. We show that although the obvious approach of continued fine-tuning on a dataset of contrastive documents containing negations increases performance (as does model size), there is still a large gap between machine and human performance.
Paper Structure (38 sections, 12 figures, 2 tables)

This paper contains 38 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Negation is something not well understood by IR systems. This screenshot shows Google Search making a deadly recommendation because of its failure to catch the negation in the article (e.g. "do not ...").
  • Figure 2: An example instance and the evaluation process. The initial documents from CondaQA ravichander2022condaqa are used to create the queries via Mechanical Turk. The lower half shows the pairwise accuracy evaluation process, where the model must rank both queries correctly. In this example, the IR model scored zero paired accuracy, ranking Doc #1 above Doc #2 in both queries (and failing to take into account the negation).
  • Figure 3: The distribution of the number of different (e.g. unique) words between the queries (left) or documents (right) in each pair. The average length differences are shown in Table \ref{['tab:statistics']}.
  • Figure 4: Error analysis of the model predictions, detailing whether models preferred (e.g. by ranking first for both queries) the document with negation (green), the edited non-negation document (orange), or predicted the reversed ranking for both queries (blue). Models that performed better generally preferred negation documents when they made incorrect predictions while bi-encoder models were more balanced in their errors.
  • Figure 5: How fine-tuning on NevIR's training set affects results on NevIR and MSMarco: upper shows NevIR's pairwise accuracy scores on test while training for up to 20 epochs, lower shows MSMarco dev MRR@10 scores. For QNLI-electra-base see Appendix \ref{['app:qnli']}.
  • ...and 7 more figures