Table of Contents
Fetching ...

Optimizing What Matters: AUC-Driven Learning for Robust Neural Retrieval

Nima Sheikholeslami, Erfan Hosseini, Patrice Bechard, Srivatsava Daruru, Sai Rajeswar

TL;DR

The paper identifies a fundamental misalignment in dense retriever training, where contrastive losses like InfoNCE fail to enforce globally calibrated scores suitable for thresholding in retrieval-augmented tasks. It introduces the Mann–Whitney (MW) loss, which directly maximizes $AUC$ by minimizing binary cross-entropy over pairwise score differences, and provides an upper bound linking $AoC$ to $\mathcal{L}_{MW}$. The approach is validated across in-distribution and out-of-distribution benchmarks, showing improved $AUC$ along with retrieval metrics and better generalization to unseen domains, while also noting a slower convergence due to the harder objective. These results motivate calibration-aware learning for dense retrieval and suggest that adopting $AUC$-aligned objectives can enhance the reliability and effectiveness of RAG systems in high-stakes settings.

Abstract

Dual-encoder retrievers depend on the principle that relevant documents should score higher than irrelevant ones for a given query. Yet the dominant Noise Contrastive Estimation (NCE) objective, which underpins Contrastive Loss, optimizes a softened ranking surrogate that we rigorously prove is fundamentally oblivious to score separation quality and unrelated to AUC. This mismatch leads to poor calibration and suboptimal performance in downstream tasks like retrieval-augmented generation (RAG). To address this fundamental limitation, we introduce the MW loss, a new training objective that maximizes the Mann-Whitney U statistic, which is mathematically equivalent to the Area under the ROC Curve (AUC). MW loss encourages each positive-negative pair to be correctly ranked by minimizing binary cross entropy over score differences. We provide theoretical guarantees that MW loss directly upper-bounds the AoC, better aligning optimization with retrieval goals. We further promote ROC curves and AUC as natural threshold free diagnostics for evaluating retriever calibration and ranking quality. Empirically, retrievers trained with MW loss consistently outperform contrastive counterparts in AUC and standard retrieval metrics. Our experiments show that MW loss is an empirically superior alternative to Contrastive Loss, yielding better-calibrated and more discriminative retrievers for high-stakes applications like RAG.

Optimizing What Matters: AUC-Driven Learning for Robust Neural Retrieval

TL;DR

The paper identifies a fundamental misalignment in dense retriever training, where contrastive losses like InfoNCE fail to enforce globally calibrated scores suitable for thresholding in retrieval-augmented tasks. It introduces the Mann–Whitney (MW) loss, which directly maximizes by minimizing binary cross-entropy over pairwise score differences, and provides an upper bound linking to . The approach is validated across in-distribution and out-of-distribution benchmarks, showing improved along with retrieval metrics and better generalization to unseen domains, while also noting a slower convergence due to the harder objective. These results motivate calibration-aware learning for dense retrieval and suggest that adopting -aligned objectives can enhance the reliability and effectiveness of RAG systems in high-stakes settings.

Abstract

Dual-encoder retrievers depend on the principle that relevant documents should score higher than irrelevant ones for a given query. Yet the dominant Noise Contrastive Estimation (NCE) objective, which underpins Contrastive Loss, optimizes a softened ranking surrogate that we rigorously prove is fundamentally oblivious to score separation quality and unrelated to AUC. This mismatch leads to poor calibration and suboptimal performance in downstream tasks like retrieval-augmented generation (RAG). To address this fundamental limitation, we introduce the MW loss, a new training objective that maximizes the Mann-Whitney U statistic, which is mathematically equivalent to the Area under the ROC Curve (AUC). MW loss encourages each positive-negative pair to be correctly ranked by minimizing binary cross entropy over score differences. We provide theoretical guarantees that MW loss directly upper-bounds the AoC, better aligning optimization with retrieval goals. We further promote ROC curves and AUC as natural threshold free diagnostics for evaluating retriever calibration and ranking quality. Empirically, retrievers trained with MW loss consistently outperform contrastive counterparts in AUC and standard retrieval metrics. Our experiments show that MW loss is an empirically superior alternative to Contrastive Loss, yielding better-calibrated and more discriminative retrievers for high-stakes applications like RAG.

Paper Structure

This paper contains 17 sections, 3 theorems, 18 equations, 4 figures, 5 tables.

Key Result

Lemma 1

Let's define: Where $s^{+}$ is a positive score and $S^{-}$ is a set of negative scores. With notations of equation equation eq:infonce, we define $s^{+} = s(q, p^{+})$ and $S^{-}=\{s(q,p^{-})\mid p^{-} \in \{p^{-}_k\}_{k=1}^{K} \}$. Placing these into definition of $\ell_\tau$, the population loss can be rewritt

Figures (4)

  • Figure 1: Histogram of positive and negative scores by models trained on NLI dataset using Contrastive loss and MW loss. Model trained with MW loss, creates better separation of scores distribution and its ROC curve dominates the ROC of the model trained with contrastive loss everywhere.
  • Figure 2: Visual Comparison of Contrastive Loss vs. MW Loss. The MW Loss performs more pairwise comparisons without increasing the embedding or similarity computation cost. In Figures \ref{['subfig:contrastive_loss_diagram']} and \ref{['subfig:mw_loss_diagram']}, each square in the colored matrix represents a similarity computation between query and passage (green for a positive pair, red for a negative pair). Similarity scores are aggregated differently for each loss function, converging to a grey square above. Each grey square is then summed up to obtain the final loss.
  • Figure 3: Examples where irrelevant passages receive similar scores to relevant ones, making threshold-based filtering unreliable. Relevant passages are in bold.
  • Figure 4: The gain in performance of using MW over CL for across metrics and models. The plots show AUC gain (left), MRR gain (center), and nDCG gain (right). Positive values indicate superior performance of MW compared to CL.

Theorems & Definitions (5)

  • Lemma 1: Shift‑invariance & unconstrained AoC for Contrastive Loss
  • Lemma 2: MW upper–bounds AoC
  • proof
  • Lemma : Lemma \ref{['lemma:shift_invariance']}
  • proof