Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

Yi Chen; Daiwei Chen; Sukrut Madhav Chikodikar; Caitlyn Heqi Yin; Ramya Korlakai Vinayak

Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

Yi Chen, Daiwei Chen, Sukrut Madhav Chikodikar, Caitlyn Heqi Yin, Ramya Korlakai Vinayak

Abstract

Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it provides no statistical guarantee that the final output is correct. Conformal factuality filtering offers distribution-free statistical reliability by scoring and filtering atomic claims using a threshold calibrated on held-out data, however, the informativeness of the final output is not guaranteed. We systematically analyze the reliability and usefulness of conformal factuality for RAG-based LLMs across generation, scoring, calibration, robustness, and efficiency. We propose novel informativeness-aware metrics that better reflect task utility under conformal filtering. Across three benchmarks and multiple model families, we find that (i) conformal filtering suffers from low usefulness at high factuality levels due to vacuous outputs, (ii) conformal factuality guarantee is not robust to distribution shifts and distractors, highlighting the limitation that requires calibration data to closely match deployment conditions, and (iii) lightweight entailment-based verifiers match or outperform LLM-based model confidence scorers while requiring over $100\times$ fewer FLOPs. Overall, our results expose factuality-informativeness trade-offs and fragility of conformal filtering framework under distribution shifts and distractors, highlighting the need for new approaches for reliability with robustness and usefulness as key metrics, and provide actionable guidance for building RAG pipelines that are both reliable and computationally efficient.

Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

Abstract

fewer FLOPs. Overall, our results expose factuality-informativeness trade-offs and fragility of conformal filtering framework under distribution shifts and distractors, highlighting the need for new approaches for reliability with robustness and usefulness as key metrics, and provide actionable guidance for building RAG pipelines that are both reliable and computationally efficient.

Paper Structure (73 sections, 2 equations, 52 figures, 3 tables)

This paper contains 73 sections, 2 equations, 52 figures, 3 tables.

Introduction
Problem Setting, Datasets, Models, and Metrics
Scoring Functions
Datasets
Language Models
Evaluation Metrics
Impact of References
Design Choices for Factuality Scoring in Conformal Filtering
Prompting Strategies for LLM Model Confidence Scores
Role of References on Model Confidence Scores
Model Choice for LLM-based Scorers
Comparison between Entailment-based and LLM-based Scoring Functions
Conditional Sufficient Correctness of the Filtered Output
Robustness of Factuality Scoring Functions
Robustness to Calibration Distribution Shift
...and 58 more sections

Figures (52)

Figure 1: Overview of our framework. Given a query $x$ and retrieved references $R(x)$, the Response Generator $G$ produces an output $y$. The conformal factuality framework utilizes a separate calibration data to determine a threshold used to filter out information from the output $y$ and yield $y'$. (See Figure \ref{['diag:pipeline']} for the details of different stages involved in conformal filtering.)
Figure 2: Given an input $x$ and a reference text related to $x$, the Response Generator produces an output $y$, which is then parsed by the Parser into a list of claims. Each individual claim is subsequently scored by the Scorer, conditioned on the input $x$ and, optionally, the reference text. These scores are passed to the conformal prediction algorithm, which filters out claims whose scores fall below a learned threshold. Finally, the remaining claims are merged into a single paragraph and returned to the user.
Figure 3: Sufficient correctness (SC) of Qwen3 models (0.6B, 4B, 8B) on four datasets (MATH-200, FActScore-Rare, FActScore, NQ-200), with and without access to references. Across model sizes and datasets, providing references consistently improves generation quality.
Figure 4: Evaluation of various prompting strategies across different LLMs on the FActScore dataset (Section \ref{['sec:prompting_strategy']}) at level $1-\alpha = 0.9$. Results demonstrate that: (i) prompting models to generate numeric scores consistently outperforms Boolean scoring; (ii) sampling multiple responses uniformly improves performance; however, (iii) incorporating chain-of-thought reasoning or evidence highlighting do not yield reliable performance gains across models.
Figure 5: Performance of model confidence score on MATH-1K dataset with and without reference provided to scoring functions using Qwen3-4B as the LLM-based scorer.
...and 47 more figures

Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

Abstract

Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

Authors

Abstract

Table of Contents

Figures (52)