Table of Contents
Fetching ...

Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead

Abdelrahman Abdallah, Jamie Holdcroft, Mohammed Ali, Adam Jatowt

Abstract

Large language model retrievers improve performance on complex queries, but their practical value depends on efficiency, robustness, and reliable confidence signals in addition to accuracy. We reproduce a reasoning-intensive retrieval benchmark (BRIGHT) across 12 tasks and 14 retrievers, and extend evaluation with cold-start indexing cost, query latency distributions and throughput, corpus scaling, robustness to controlled query perturbations, and confidence use (AUROC) for predicting query success. We also quantify \emph{reasoning overhead} by comparing standard queries to five provided reasoning-augmented variants, measuring accuracy gains relative to added latency. We find that some reasoning-specialized retrievers achieve strong effectiveness while remaining competitive in throughput, whereas several large LLM-based bi-encoders incur substantial latency for modest gains. Reasoning augmentation incurs minimal latency for sub-1B encoders but exhibits diminishing returns for top retrievers and may reduce performance on formal math/code domains. Confidence calibration is consistently weak across model families, indicating that raw retrieval scores are unreliable for downstream routing without additional calibration. We release all code and artifacts for reproducibility.

Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead

Abstract

Large language model retrievers improve performance on complex queries, but their practical value depends on efficiency, robustness, and reliable confidence signals in addition to accuracy. We reproduce a reasoning-intensive retrieval benchmark (BRIGHT) across 12 tasks and 14 retrievers, and extend evaluation with cold-start indexing cost, query latency distributions and throughput, corpus scaling, robustness to controlled query perturbations, and confidence use (AUROC) for predicting query success. We also quantify \emph{reasoning overhead} by comparing standard queries to five provided reasoning-augmented variants, measuring accuracy gains relative to added latency. We find that some reasoning-specialized retrievers achieve strong effectiveness while remaining competitive in throughput, whereas several large LLM-based bi-encoders incur substantial latency for modest gains. Reasoning augmentation incurs minimal latency for sub-1B encoders but exhibits diminishing returns for top retrievers and may reduce performance on formal math/code domains. Confidence calibration is consistently weak across model families, indicating that raw retrieval scores are unreliable for downstream routing without additional calibration. We release all code and artifacts for reproducibility.

Paper Structure

This paper contains 28 sections, 4 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Pareto frontier: nDCG@10 vs. throughput (QPS). Points above and to the right dominate; dashed line connects Pareto-optimal models.
  • Figure 2: nDCG@10 gain vs. latency penalty per model. Upper-left quadrant (high gain, low cost) is ideal.
  • Figure 3: Task-level nDCG@10 gain (averaged across retrievers).
  • Figure 4: Robustness under query perturbations, shown as retention ratio: $\text{nDCG@10}_{\text{perturbed}} / \text{nDCG@10}_{\text{original}}$. Values above 1.0 indicate improvement; below 1.0 indicate degradation.
  • Figure 5: Hybrid fusion of BM25 with seven dense retrievers (12-task average nDCG@10). Three strategies shown: Reciprocal Rank Fusion (RRF), Linear score combination ($\alpha{=}0.5$), and Dynamic Weight Adaptation (DAT).