NevIR: Negation in Neural Information Retrieval
Orion Weller, Dawn Lawrie, Benjamin Van Durme
TL;DR
NevIR introduces a targeted benchmark to evaluate negation handling in neural information retrieval by pairing minimally different documents that differ in negation and crowdsourcing queries to test whether models rank correctly for both. Across model families, cross-encoders perform best but still struggle, with many architectures performing at or below random baselines; ColBERT shows intermediate results and reveals issues in the MaxSim operator. The paper demonstrates that negation-specific fine-tuning yields meaningful gains, though large gains remain distant from human performance and first-stage retrieval remains a bottleneck. NevIR provides a valuable, contrastive dataset for training and evaluation to spur development of negation-aware IR systems, with implications for safer and more reliable search in high-stakes and everyday scenarios.
Abstract
Negation is a common everyday phenomena and has been a consistent area of weakness for language models (LMs). Although the Information Retrieval (IR) community has adopted LMs as the backbone of modern IR architectures, there has been little to no research in understanding how negation impacts neural IR. We therefore construct a straightforward benchmark on this theme: asking IR models to rank two documents that differ only by negation. We show that the results vary widely according to the type of IR architecture: cross-encoders perform best, followed by late-interaction models, and in last place are bi-encoder and sparse neural architectures. We find that most information retrieval models (including SOTA ones) do not consider negation, performing the same or worse than a random ranking. We show that although the obvious approach of continued fine-tuning on a dataset of contrastive documents containing negations increases performance (as does model size), there is still a large gap between machine and human performance.
