TRAQ: Trustworthy Retrieval Augmented Question Answering via Conformal Prediction

Shuo Li; Sangdon Park; Insup Lee; Osbert Bastani

TRAQ: Trustworthy Retrieval Augmented Question Answering via Conformal Prediction

Shuo Li, Sangdon Park, Insup Lee, Osbert Bastani

TL;DR

TRAQ addresses hallucinations in open-domain QA by pairing retrieval-augmented generation with conformal prediction to yield end-to-end probabilistic guarantees. It builds separate conformal prediction sets for retrieval and for the LLM, then aggregates them into a final output whose semantically correct answer lies in the set with probability at least $1-\alpha$, where $\alpha = \alpha_{Ret}+\alpha_{LLM}$. A novel semantic-clustering nonconformity measure enables robust uncertainty quantification across paraphrase variants and supports black-box LLMs; Bayesian optimization reduces the average prediction-set size without compromising coverage. Empirical results across Natural Questions, TriviaQA, SQuAD-1, and BioASQ show TRAQ achieves the desired coverage and reduces set size by about $16.2\%$ on average, indicating practical improvements for trustworthy RAG QA. The work delivers a principled, scalable approach to uncertainty in open-domain QA with provable guarantees and broad applicability to API-based LLMs.

Abstract

When applied to open-domain question answering, large language models (LLMs) frequently generate incorrect responses based on made-up facts, which are called $\textit{hallucinations}$. Retrieval augmented generation (RAG) is a promising strategy to avoid hallucinations, but it does not provide guarantees on its correctness. To address this challenge, we propose the Trustworthy Retrieval Augmented Question Answering, or $\textit{TRAQ}$, which provides the first end-to-end statistical correctness guarantee for RAG. TRAQ uses conformal prediction, a statistical technique for constructing prediction sets that are guaranteed to contain the semantically correct response with high probability. Additionally, TRAQ leverages Bayesian optimization to minimize the size of the constructed sets. In an extensive experimental evaluation, we demonstrate that TRAQ provides the desired correctness guarantee while reducing prediction set size by 16.2% on average compared to an ablation. The implementation is available at $\href{https://github.com/shuoli90/TRAQ.git}{TRAQ}$.

TRAQ: Trustworthy Retrieval Augmented Question Answering via Conformal Prediction

TL;DR

, where

. A novel semantic-clustering nonconformity measure enables robust uncertainty quantification across paraphrase variants and supports black-box LLMs; Bayesian optimization reduces the average prediction-set size without compromising coverage. Empirical results across Natural Questions, TriviaQA, SQuAD-1, and BioASQ show TRAQ achieves the desired coverage and reduces set size by about

on average, indicating practical improvements for trustworthy RAG QA. The work delivers a principled, scalable approach to uncertainty in open-domain QA with provable guarantees and broad applicability to API-based LLMs.

Abstract

When applied to open-domain question answering, large language models (LLMs) frequently generate incorrect responses based on made-up facts, which are called

. Retrieval augmented generation (RAG) is a promising strategy to avoid hallucinations, but it does not provide guarantees on its correctness. To address this challenge, we propose the Trustworthy Retrieval Augmented Question Answering, or

, which provides the first end-to-end statistical correctness guarantee for RAG. TRAQ uses conformal prediction, a statistical technique for constructing prediction sets that are guaranteed to contain the semantically correct response with high probability. Additionally, TRAQ leverages Bayesian optimization to minimize the size of the constructed sets. In an extensive experimental evaluation, we demonstrate that TRAQ provides the desired correctness guarantee while reducing prediction set size by 16.2% on average compared to an ablation. The implementation is available at

Paper Structure (50 sections, 11 theorems, 30 equations, 17 figures, 4 tables, 3 algorithms)

This paper contains 50 sections, 11 theorems, 30 equations, 17 figures, 4 tables, 3 algorithms.

Introduction
Contributions.
Background
Retrieval for Open-Domain QA.
Conformal Prediction.
Uncertainty Quantification for LLMs.
The TRAQ Framework
Assumptions
Prediction Set Construction
Retriever Set:
LLM Set:
Aggregated Set:
Performance Improvement
Experiments
Experiment Setup.
...and 35 more sections

Key Result

Theorem 1

Conformal Prediction Guarantee angelopoulos2022gentleshafer2007tutorialcp. Suppose that $\{(x_i, y_i)\}_{i=1}^N$ and $(X_{\text{test}},Y_{\text{test}})$ are i.i.d. from $\mathcal{D}$, and $C(X_{\text{test}})$ is constructed by eq:cp; then, we have the following.

Figures (17)

Figure 1: Comparison of the standard RAG pipeline with TRAQ on a practical illustration reveals a significant difference. With the standard retrieval augmented generation (RAG) approach, there is a possibility that the retrieved passage may lack relevance in addressing the given question. On the contrary, TRAQ leverages conformal prediction to ensure that the retrieved set includes the relevant passage with a high probability and that the LLM set contains a semantically correct answer with a high probability. Through the aggregation of these prediction sets, TRAQ provides a guarantee that a semantically correct answer is contained in its set of answers with a high probability.
Figure 2: Given a question, TRAQ first constructs the retriever prediction; then, for every (question, contained passage) pair, TRAQ constructs a LLM prediction on the LLM generated responses. Finally, the LLM prediction sets are aggregated as the final output. In Figure \ref{['fig:optimization']}, TRAQ takes candidate error budgets from Bayesian optimization; it then constructs aggregated prediction sets on the optimization set. Next, the average semantic counts in constructed sets are computed to update the Gaussian process model in Bayesian optimization.
Figure 3: Retriever and generator coverage rates on the BioASQ dataset.
Figure 4: End-to-end guarantee considering only the most relevant passage on BioASQ Dataset.
Figure 5: End-to-end coverage guarantee considering all passages on the BioASQ dataset.
...and 12 more figures

Theorems & Definitions (16)

Theorem 1
Theorem 2
Lemma 2.1
Lemma 2.2
Theorem 3
Theorem 4: vovk2012conditionalpark2021pac
proof : Proof of Lemma \ref{['co:ret']}
proof : Proof of Lemma \ref{['co:chat']}
proof : Proof of Theorem \ref{['th:e2e']}
Lemma 4.1
...and 6 more

TRAQ: Trustworthy Retrieval Augmented Question Answering via Conformal Prediction

TL;DR

Abstract

TRAQ: Trustworthy Retrieval Augmented Question Answering via Conformal Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (16)