Table of Contents
Fetching ...

Towards Robust Ranker for Text Retrieval

Yucheng Zhou, Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Guodong Long, Binxing Jiao, Daxin Jiang

TL;DR

This work tackles the robustness of text rankers in retrieval–rerank pipelines by addressing two key issues: label noise from strong retrievers and suboptimal negative sampling. It proposes R$^2$anker, a framework that leverages multiple retrievers as negative generators to create open-set and diverse hard negatives, guided by a joint adversarial-like training objective and open-set noise strategies. Empirical results on MS-Marco demonstrate state-of-the-art performance for BM25-reranking and full-ranking, and the method can distill into a competitive first-stage retriever, enabling efficient end-to-end improvements. Distribution analyses further show that diversity and distribution alignment of negatives are crucial for robust ranker training, underscoring the practical value of multi-generator negative sampling in large-scale IR.

Abstract

A ranker plays an indispensable role in the de facto 'retrieval & rerank' pipeline, but its training still lags behind -- learning from moderate negatives or/and serving as an auxiliary module for a retriever. In this work, we first identify two major barriers to a robust ranker, i.e., inherent label noises caused by a well-trained retriever and non-ideal negatives sampled for a high-capable ranker. Thereby, we propose multiple retrievers as negative generators improve the ranker's robustness, where i) involving extensive out-of-distribution label noises renders the ranker against each noise distribution, and ii) diverse hard negatives from a joint distribution are relatively close to the ranker's negative distribution, leading to more challenging thus effective training. To evaluate our robust ranker (dubbed R$^2$anker), we conduct experiments in various settings on the popular passage retrieval benchmark, including BM25-reranking, full-ranking, retriever distillation, etc. The empirical results verify the new state-of-the-art effectiveness of our model.

Towards Robust Ranker for Text Retrieval

TL;DR

This work tackles the robustness of text rankers in retrieval–rerank pipelines by addressing two key issues: label noise from strong retrievers and suboptimal negative sampling. It proposes Ranker, a framework that leverages multiple retrievers as negative generators to create open-set and diverse hard negatives, guided by a joint adversarial-like training objective and open-set noise strategies. Empirical results on MS-Marco demonstrate state-of-the-art performance for BM25-reranking and full-ranking, and the method can distill into a competitive first-stage retriever, enabling efficient end-to-end improvements. Distribution analyses further show that diversity and distribution alignment of negatives are crucial for robust ranker training, underscoring the practical value of multi-generator negative sampling in large-scale IR.

Abstract

A ranker plays an indispensable role in the de facto 'retrieval & rerank' pipeline, but its training still lags behind -- learning from moderate negatives or/and serving as an auxiliary module for a retriever. In this work, we first identify two major barriers to a robust ranker, i.e., inherent label noises caused by a well-trained retriever and non-ideal negatives sampled for a high-capable ranker. Thereby, we propose multiple retrievers as negative generators improve the ranker's robustness, where i) involving extensive out-of-distribution label noises renders the ranker against each noise distribution, and ii) diverse hard negatives from a joint distribution are relatively close to the ranker's negative distribution, leading to more challenging thus effective training. To evaluate our robust ranker (dubbed Ranker), we conduct experiments in various settings on the popular passage retrieval benchmark, including BM25-reranking, full-ranking, retriever distillation, etc. The empirical results verify the new state-of-the-art effectiveness of our model.
Paper Structure (24 sections, 12 equations, 3 figures, 4 tables)

This paper contains 24 sections, 12 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: BM25-reranking performance of the rankers trained on different negative distributions by specific retrievers (left) and false negative labels brought by the two well-trained strong retrievers in contrast to BM25 retriever (right). Here, 'R1' denotes a well-trained coCondenrser Gao22Unsupervised dense-vector retriever whereas 'R2' denotes a well-trained SPLADE Formal21SPLADEv2 lexicon-weighting retriever.
  • Figure 2: BM25-reranking performance by various rankers that were trained on negatives sampled from retrievers' (joint) distributions. D1, D2, L1, and L2 is abbreviations for Den-BN, Den-HN, Lex-BN, and Lex-HN retriever, respectively. 'KL divergence' denotes the difference between the retrievers' (joint) distribution and BM25 retriever's, i.e., $KL(P(\cdot|q; \Theta^{\text{(be)}})|\mathop{\mathrm{BM25}}\limits(\cdot|q; {\mathcal{D}}))$, which is used to measure negatives' distribution. For example, the point 'bm25,D2,L2' denotes that i) the KL between its joint retriever's distribution and BM25 retriever's distribution is round 0.4, and ii) a ranker trained on that joint negative distribution can achieve 41.1 MRR@10 on BM25 reranking.
  • Figure 3: BM25-reranking performance by various rankers vs. relevance score distribution changes from the (joint) retriever to the trained ranker. In formal, $\Delta = KL(P(\cdot|q; \Theta^{\text{(be)}})|\mathop{\mathrm{BM25}}\limits(\cdot|q; {\mathcal{D}})) - KL(P(\cdot|q; \theta^{\text{(ce)}})|\mathop{\mathrm{BM25}}\limits(\cdot|q; {\mathcal{D}}))$, where $\theta^{\text{(ce)}}$ is trained with a specific $\Theta^{\text{(be)}}$. The smaller $|\Delta|$, the negative sampling distribution closer to the ideal negative distribution of the ranker, as learning on the retriever-sampled negatives will not shift the distribution.