Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism

Hippolyte Gisserot-Boukhlef; Manuel Faysse; Emmanuel Malherbe; Céline Hudelot; Pierre Colombo

Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism

Hippolyte Gisserot-Boukhlef, Manuel Faysse, Emmanuel Malherbe, Céline Hudelot, Pierre Colombo

TL;DR

This work tackles trustworthy neural information retrieval by enabling abstention in the reranking stage under black-box constraints. It introduces two confidence-estimation paradigms: a reference-free approach using simple score statistics and a data-driven approach using a reference set to calibrate a regression-based confidence function, with a threshold-based abstention decision. The standout result is that a reference-based linear regression confidence ($u_{\text{lin}}$) consistently outperforms reference-free baselines across six multilingual datasets, achieving higher $nAUC_m$ while incurring negligible overhead ($\approx 1.2\%$ of relevance-score computation time). Domain transfer experiments show calibration depends on distributional similarity and the number of positives per instance, but small calibration sets can suffice. Overall, the proposed abstention mechanism offers a practical, training-light method to enhance trustworthiness and efficiency of IR pipelines such as RAG and search systems.

Abstract

Neural Information Retrieval (NIR) has significantly improved upon heuristic-based Information Retrieval (IR) systems. Yet, failures remain frequent, the models used often being unable to retrieve documents relevant to the user's query. We address this challenge by proposing a lightweight abstention mechanism tailored for real-world constraints, with particular emphasis placed on the reranking phase. We introduce a protocol for evaluating abstention strategies in black-box scenarios (typically encountered when relying on API services), demonstrating their efficacy, and propose a simple yet effective data-driven mechanism. We provide open-source code for experiment replication and abstention implementation, fostering wider adoption and application in diverse contexts.

Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism

TL;DR

) consistently outperforms reference-free baselines across six multilingual datasets, achieving higher

while incurring negligible overhead (

of relevance-score computation time). Domain transfer experiments show calibration depends on distributional similarity and the number of positives per instance, but small calibration sets can suffice. Overall, the proposed abstention mechanism offers a practical, training-light method to enhance trustworthiness and efficiency of IR pipelines such as RAG and search systems.

Abstract

Paper Structure (31 sections, 10 equations, 9 figures, 12 tables)

This paper contains 31 sections, 10 equations, 9 figures, 12 tables.

Introduction
Problem Statement & Related Work
Notations
Abstention in Reranking
Related Work
Confidence Assessment for Document Reranking
Reference-Free Scenario
Data-Driven Scenario
Experimental Setup
Models and Datasets
Instance-Wise Metrics
Assessing Abstention Performance
Results
Abstention Performance
Abstention Effectiveness vs. Raw Model Performance
...and 16 more sections

Figures (9)

Figure 1: Procedure diagram for black-box confidence estimation and abstention decision in a reranking setting. In the reference-free scenario, confidence function $u$ is a simple heuristic (e.g., maximum). In the data-driven scenario, $u$ is a light non-trivial function of the relevance scores (e.g., learned linear combination).
Figure 1: Abstention performance. nAUCs in % averaged model-wise, for each method, dataset and metric.
Figure 3: Abstention performance (nAUC in %) vs. no-abstention performance (mAP), for all models and datasets, using $u_{\text{lin}}$ as a confidence function. Each data point represents a model-dataset pair.
Figure 3: MAEs for target vs. achieved abstention rates and performance levels (in %) (model: ember-v1, dataset: StackOverflow, metric: mAP).
Figure 4: nAUC vs. reference set size for $u_{\text{lin}}$ and $u_{\text{std}}$ on all datasets (model: ember-v1, metric: mAP).
...and 4 more figures

Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism

TL;DR

Abstract

Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism

Authors

TL;DR

Abstract

Table of Contents

Figures (9)