Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Mingda Li; Xinyu Li; Yifan Chen; Wenfeng Xuan; Weinan Zhang

Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Mingda Li, Xinyu Li, Yifan Chen, Wenfeng Xuan, Weinan Zhang

TL;DR

The paper addresses why Retrieval-Augmented LLMs show inconsistent, example-level performance across retriever choices. It introduces a theoretical error decomposition into $E_r$, $E_h$, $E_e$, and $E_{luck}$, and demonstrates that both knowledge-source differences and unpredictable reader degeneration drive this instability. To mitigate it, the authors propose Ensemble of Retrievers (EoR), a trainable, voting-based framework that samples from multiple retrievers and uses similarity-based scoring to select the best answer without retraining the LLMs. Empirical results on open-domain QA across multiple datasets and base models show that EoR reduces inconsistent behaviors (lower MRLR) and improves corpus-level accuracy compared with single-retriever RALMs. This framework offers a practical, model-agnostic approach to bolstering the robustness and reliability of retrieval-augmented systems in real-world deployments.

Abstract

Although Retrieval-Augmented Large Language Models (RALMs) demonstrate their superiority in terms of factuality, they do not consistently outperform the original retrieval-free Language Models (LMs). Our experiments reveal that this example-level performance inconsistency exists not only between retrieval-augmented and retrieval-free LM but also among different retrievers. To understand this phenomenon, we investigate the degeneration behavior of RALMs and theoretically decompose it into four categories. Further analysis based on our decomposition reveals that the innate difference in knowledge sources and the unpredictable degeneration of the reader model contribute most to the inconsistency. Drawing from our analysis, we introduce Ensemble of Retrievers (EoR), a trainable framework that can adaptively retrieve from different knowledge sources and effectively decrease unpredictable reader errors. Our experiments on Open Domain Question Answering show that EoR substantially improves performance over the RALM with a single retriever by considerably reducing inconsistent behaviors.

Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

TL;DR

The paper addresses why Retrieval-Augmented LLMs show inconsistent, example-level performance across retriever choices. It introduces a theoretical error decomposition into

, and

, and demonstrates that both knowledge-source differences and unpredictable reader degeneration drive this instability. To mitigate it, the authors propose Ensemble of Retrievers (EoR), a trainable, voting-based framework that samples from multiple retrievers and uses similarity-based scoring to select the best answer without retraining the LLMs. Empirical results on open-domain QA across multiple datasets and base models show that EoR reduces inconsistent behaviors (lower MRLR) and improves corpus-level accuracy compared with single-retriever RALMs. This framework offers a practical, model-agnostic approach to bolstering the robustness and reliability of retrieval-augmented systems in real-world deployments.

Abstract

Paper Structure (24 sections, 29 equations, 9 figures, 3 tables)

This paper contains 24 sections, 29 equations, 9 figures, 3 tables.

Introduction
Retrievers Are Inconsistent
Experimental Setup
Experimental Results
Why Dose Retriever Inconsistency Happen?
Error Decomposition
Retriever Inconsistency Stems From Irregular Error Patterns
Ensemble of Retrievers
Our Method
Ensemble by Learning
Experimental Setting
Results and Analysis
Related Work
Conclusion
Acknowledgements
...and 9 more sections

Figures (9)

Figure 1: Retriever-to-Retriever Relative Win Ratio heatmap on Natural Questions with ChatGPT as LM. Each cell's number represents the proportion of questions answered incorrectly by the column retriever that was correctly answered by the row retriever. 0 represents all questions correctly answered by the row retriever can be correctly answered by the column retriever, which implies the column retriever consistently outperform the row retriever. See equation \ref{['eq1']} for formal definition.
Figure 2: Boxplot displaying the distribution of MRLR of 15 different retrievers across different dataset and models.
Figure 3: Corpus-level performance of different retrievers evaluated by BEM Accuracy on different datasets with ChatGPT as base LM. The order of retrievers is sorted by performance on NQ.
Figure 4: Error Relative Win Ratio between different Retrievers with ChatGPT as base LM, evaluated on NQ validation set. 0 represents the column retriever consistently outperforms the row retriever concerning error occurrence and -1 means at least one of the retrievers is free of this error. We only show part of the result because of space limitation, but the finding is same, more graphs please refer to Appendix \ref{['appendix: more figures']}.
Figure 5: The upper bound of BEM Accuracy by ensembling different retrievers on NQ with ChatGPT as base LM. Each boxplot represents the distribution of the upper bound for different retriever combinations with the same pool size. The dashed line shows the best single retriever performance.
...and 4 more figures

Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

TL;DR

Abstract

Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)