Uncertainty Quantification for Retrieval-Augmented Reasoning
Heydar Soudani, Hamed Zamani, Faegheh Hasibi
TL;DR
This work tackles uncertainty quantification for retrieval-augmented reasoning (RAR), where uncertainty arises from both the retriever and the generator. It introduces Retrieval-Augmented Reasoning Consistency (R^2C), an MDPl-based framework that perturbatively explores diverse reasoning paths through three actions (query paraphrasing, critical rethinking, and answer validation) and uses majority voting to derive uncertainty scores. Across five RAR models and multiple QA datasets, R^2C yields AUROC gains of over 5% on average over state-of-the-art baselines, and shows strong extrinsic performance for abstention and model selection tasks, including ~5% improvements in Abstain metrics and ~7% gains in exact-match for model selection. The results demonstrate that explicitly modeling uncertainty from both retrieval and generation, coupled with input-diversifying perturbations, improves reliability while maintaining or enhancing efficiency, with promising implications for broader deployment in knowledge-intensive NLP systems.
Abstract
Retrieval-augmented reasoning (RAR) is a recent evolution of retrieval-augmented generation (RAG) that employs multiple reasoning steps for retrieval and generation. While effective for some complex queries, RAR remains vulnerable to errors and misleading outputs. Uncertainty quantification (UQ) offers methods to estimate the confidence of systems' outputs. These methods, however, often handle simple queries with no retrieval or single-step retrieval, without properly handling RAR setup. Accurate estimation of UQ for RAR requires accounting for all sources of uncertainty, including those arising from retrieval and generation. In this paper, we account for all these sources and introduce Retrieval-Augmented Reasoning Consistency (R2C)--a novel UQ method for RAR. The core idea of R2C is to perturb the multi-step reasoning process by applying various actions to reasoning steps. These perturbations alter the retriever's input, which shifts its output and consequently modifies the generator's input at the next step. Through this iterative feedback loop, the retriever and generator continuously reshape one another's inputs, enabling us to capture uncertainty arising from both components. Experiments on five popular RAR systems across diverse QA datasets show that R2C improves AUROC by over 5% on average compared to the state-of-the-art UQ baselines. Extrinsic evaluations using R2C as an external signal further confirm its effectiveness for two downstream tasks: in Abstention, it achieves ~5% gains in both F1Abstain and AccAbstain; in Model Selection, it improves the exact match by ~7% over single models and ~3% over selection methods.
