Table of Contents
Fetching ...

Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs

Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin

TL;DR

This work challenges the conventional belief that more training data always improves information retrieval performance. It introduces RLHN, a cascaded LLM framework to identify and relabel false negatives in large IR training datasets, coupled with three data-modification strategies. The authors show that pruning the BGE dataset can yield substantial accuracy gains while dramatically reducing data size, and that relabeling false negatives with RLHN improves both retrievers and rerankers on BEIR and AIR-Bench, with human validation supporting LLM judgments. Overall, the paper demonstrates that data quality, rather than sheer quantity, is crucial for robust IR with LLMs, and provides a practical pipeline with public data and code for broader adoption.

Abstract

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection, reduces the training set size by 2.35$\times$, surprisingly increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We utilize LLMs as a simple, cost-effective approach to identify and relabel false negatives in training datasets. Experimental results show that relabeling false negatives as true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7$\unicode{x2013}$1.4 points on BEIR and by 1.7$\unicode{x2013}$1.8 points at nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of LLMs to identify false negatives is supported by human annotation results. Our training dataset and code are publicly available.

Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs

TL;DR

This work challenges the conventional belief that more training data always improves information retrieval performance. It introduces RLHN, a cascaded LLM framework to identify and relabel false negatives in large IR training datasets, coupled with three data-modification strategies. The authors show that pruning the BGE dataset can yield substantial accuracy gains while dramatically reducing data size, and that relabeling false negatives with RLHN improves both retrievers and rerankers on BEIR and AIR-Bench, with human validation supporting LLM judgments. Overall, the paper demonstrates that data quality, rather than sheer quantity, is crucial for robust IR with LLMs, and provides a practical pipeline with public data and code for broader adoption.

Abstract

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection, reduces the training set size by 2.35, surprisingly increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We utilize LLMs as a simple, cost-effective approach to identify and relabel false negatives in training datasets. Experimental results show that relabeling false negatives as true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.71.4 points on BEIR and by 1.71.8 points at nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of LLMs to identify false negatives is supported by human annotation results. Our training dataset and code are publicly available.

Paper Structure

This paper contains 44 sections, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Example of a training instance (query, ground truth positives, and unlabeled hard negatives) with detected false negatives taken from HotpotQA. The false negative passage (Splash Works) is mislabeled as it is relevant in answering the user's query. The relevant parts of the text useful in answering the query are highlighted in blue.
  • Figure 2: Flowchart for RLHN (ReLabeling Hard Negatives): (1) Provide the query, ground-truth or positive passages, and hard negative passages from a training instance as input, (2) Prompt a cost-effective LLM judge (e.g., GPT-4o-mini) and evaluate whether any hard negative is misclassified, (3) If yes, repeat the prompt with an accurate LLM judge (e.g., GPT-4o) (4) Output the relabeled hard negative passages (which are found relevant) and either remove them or relabel them as ground-truth passages in our experiments.
  • Figure 3: The distribution of training pairs (with at least one false negative) across false hard negatives detected. 58% of the training pairs detected contain a single false negative, 19% with two false negatives, and so on.
  • Figure 4: Dataset pruning by leaving one dataset out during fine-tuning E5 (base) on the BGE-training collection; [ALL] denotes fine-tuning on all datasets with 1.6M training pairs; [7 Pruned] denotes fine-tuning on 680K training pairs with seven remaining datasets (or 57.5% pairs) after dataset pruning. [Better than ALL] denotes the results improved after removing the dataset, meaning it has negative impact on the training process. [Worse than ALL] denotes the opposite, where the dataset has a positive impact on the training.
  • Figure 5: nDCG@10 scores on BEIR (Avg. 16 and Avg. 7) and AIR-Bench (Avg. 5) by fine-tuning E5 (base) on a subset of the 100K, 250K, 400K, and 680K training pairs using the "RLHN" technique for both stages. All individual dataset scores for both BEIR and AIR-Bench are provided in \ref{['fig:all-scores-beir']} and \ref{['fig:all-scores-air-bench']}.
  • ...and 6 more figures