Table of Contents
Fetching ...

Reproducing NevIR: Negation in Neural Information Retrieval

Coen van den Elsen, Francien Barkhof, Thijmen Nijdam, Simon Lupart, Mohammad Aliannejadi

TL;DR

This work reproduces NevIR and extends the evaluation to contemporary IR architectures, including listwise LLM re-rankers, while introducing ExcluIR to probe generalization of negation understanding. It shows that although newer models improve negation handling, performance remains below human levels, with listwise LLM re-rankers offering the strongest gains at high computational cost. Fine-tuning on negation data can boost NevIR performance but risks overfitting and degradation on general ranking tasks like MS MARCO; cross-encoder models exhibit better cross-dataset transfer between NevIR and ExcluIR. The study highlights dataset-specific negation patterns, suggests a trade-off-based early stopping method to mitigate overfitting, and provides reproducible guidance for evaluating negation in IR across emerging model families.

Abstract

Negation is a fundamental aspect of human communication, yet it remains a challenge for Language Models (LMs) in Information Retrieval (IR). Despite the heavy reliance of modern neural IR systems on LMs, little attention has been given to their handling of negation. In this study, we reproduce and extend the findings of NevIR, a benchmark study that revealed most IR models perform at or below the level of random ranking when dealing with negation. We replicate NevIR's original experiments and evaluate newly developed state-of-the-art IR models. Our findings show that a recently emerging category-listwise Large Language Model (LLM) re-rankers-outperforms other models but still underperforms human performance. Additionally, we leverage ExcluIR, a benchmark dataset designed for exclusionary queries with extensive negation, to assess the generalisability of negation understanding. Our findings suggest that fine-tuning on one dataset does not reliably improve performance on the other, indicating notable differences in their data distributions. Furthermore, we observe that only cross-encoders and listwise LLM re-rankers achieve reasonable performance across both negation tasks.

Reproducing NevIR: Negation in Neural Information Retrieval

TL;DR

This work reproduces NevIR and extends the evaluation to contemporary IR architectures, including listwise LLM re-rankers, while introducing ExcluIR to probe generalization of negation understanding. It shows that although newer models improve negation handling, performance remains below human levels, with listwise LLM re-rankers offering the strongest gains at high computational cost. Fine-tuning on negation data can boost NevIR performance but risks overfitting and degradation on general ranking tasks like MS MARCO; cross-encoder models exhibit better cross-dataset transfer between NevIR and ExcluIR. The study highlights dataset-specific negation patterns, suggests a trade-off-based early stopping method to mitigate overfitting, and provides reproducible guidance for evaluating negation in IR across emerging model families.

Abstract

Negation is a fundamental aspect of human communication, yet it remains a challenge for Language Models (LMs) in Information Retrieval (IR). Despite the heavy reliance of modern neural IR systems on LMs, little attention has been given to their handling of negation. In this study, we reproduce and extend the findings of NevIR, a benchmark study that revealed most IR models perform at or below the level of random ranking when dealing with negation. We replicate NevIR's original experiments and evaluate newly developed state-of-the-art IR models. Our findings show that a recently emerging category-listwise Large Language Model (LLM) re-rankers-outperforms other models but still underperforms human performance. Additionally, we leverage ExcluIR, a benchmark dataset designed for exclusionary queries with extensive negation, to assess the generalisability of negation understanding. Our findings suggest that fine-tuning on one dataset does not reliably improve performance on the other, indicating notable differences in their data distributions. Furthermore, we observe that only cross-encoders and listwise LLM re-rankers achieve reasonable performance across both negation tasks.

Paper Structure

This paper contains 14 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: This example illustrates how misinterpreting a negation in a medical context could cause someone to misuse a critical emergency medication like epinephrine
  • Figure 2: An example NevIR instance and the pairwise evaluation process for a contrastive query-document pair. This instance is classified as incorrect because only one of the two queries is ranked correctly, indicating that the ranker fails to account for negation.
  • Figure 3: Fine-tuning results on NevIR. The top plot shows pairwise accuracy on NevIR, while the bottom plot presents MRR@10 on MS MARCO.
  • Figure 4: Model performance on ExcluIR and NevIR of the four families of models. Models were fine-tuned on each dataset separately and on the merged dataset.
  • Figure 5: An MS MARCO query with the rankings by the multi-qa-mpnet model and its fine-tuned variant on NevIR.