Table of Contents
Fetching ...

Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards

Faisal Hamman, Chenyang Zhu, Anoop Kumar, Xujun Peng, Sanghamitra Dutta, Daben Liu, Alfy Samuel

TL;DR

This work addresses information consistency in retrieval-augmented generation (RAG) systems, where outputs should convey the same core content for semantically equivalent queries. It introduces a principled evaluation framework that decomposes consistency into retriever-level, generator-level, and end-to-end components. To improve consistency, it proposes Paraphrased Set Group Relative Policy Optimization (PS-GRPO), a reinforcement learning approach that uses group similarity rewards across paraphrase sets to train the generator (Con-RAG), with a scalable approximation to make training feasible. Empirical results across short-form, multi-hop, and long-form QA on diverse model families show that Con-RAG improves both consistency and accuracy without explicit ground-truth supervision, offering practical guidance for reliable RAG deployments in safety-critical settings.

Abstract

RAG systems are increasingly deployed in high-stakes domains where users expect outputs to be consistent across semantically equivalent queries. However, existing systems often exhibit significant inconsistencies due to variability in both the retriever and generator (LLM), undermining trust and reliability. In this work, we focus on information consistency, i.e., the requirement that outputs convey the same core content across semantically equivalent inputs. We introduce a principled evaluation framework that decomposes RAG consistency into retriever-level, generator-level, and end-to-end components, helping identify inconsistency sources. To improve consistency, we propose Paraphrased Set Group Relative Policy Optimization (PS-GRPO), an RL approach that leverages multiple rollouts across paraphrased set to assign group similarity rewards. We leverage PS-GRPO to achieve Information Consistent RAG (Con-RAG), training the generator to produce consistent outputs across paraphrased queries and remain robust to retrieval-induced variability. Because exact reward computation over paraphrase sets is computationally expensive, we also introduce a scalable approximation method that retains effectiveness while enabling efficient, large-scale training. Empirical evaluations across short-form, multi-hop, and long-form QA benchmarks demonstrate that Con-RAG significantly improves both consistency and accuracy over strong baselines, even in the absence of explicit ground-truth supervision. Our work provides practical solutions for evaluating and building reliable RAG systems for safety-critical deployments.

Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards

TL;DR

This work addresses information consistency in retrieval-augmented generation (RAG) systems, where outputs should convey the same core content for semantically equivalent queries. It introduces a principled evaluation framework that decomposes consistency into retriever-level, generator-level, and end-to-end components. To improve consistency, it proposes Paraphrased Set Group Relative Policy Optimization (PS-GRPO), a reinforcement learning approach that uses group similarity rewards across paraphrase sets to train the generator (Con-RAG), with a scalable approximation to make training feasible. Empirical results across short-form, multi-hop, and long-form QA on diverse model families show that Con-RAG improves both consistency and accuracy without explicit ground-truth supervision, offering practical guidance for reliable RAG deployments in safety-critical settings.

Abstract

RAG systems are increasingly deployed in high-stakes domains where users expect outputs to be consistent across semantically equivalent queries. However, existing systems often exhibit significant inconsistencies due to variability in both the retriever and generator (LLM), undermining trust and reliability. In this work, we focus on information consistency, i.e., the requirement that outputs convey the same core content across semantically equivalent inputs. We introduce a principled evaluation framework that decomposes RAG consistency into retriever-level, generator-level, and end-to-end components, helping identify inconsistency sources. To improve consistency, we propose Paraphrased Set Group Relative Policy Optimization (PS-GRPO), an RL approach that leverages multiple rollouts across paraphrased set to assign group similarity rewards. We leverage PS-GRPO to achieve Information Consistent RAG (Con-RAG), training the generator to produce consistent outputs across paraphrased queries and remain robust to retrieval-induced variability. Because exact reward computation over paraphrase sets is computationally expensive, we also introduce a scalable approximation method that retains effectiveness while enabling efficient, large-scale training. Empirical evaluations across short-form, multi-hop, and long-form QA benchmarks demonstrate that Con-RAG significantly improves both consistency and accuracy over strong baselines, even in the absence of explicit ground-truth supervision. Our work provides practical solutions for evaluating and building reliable RAG systems for safety-critical deployments.

Paper Structure

This paper contains 10 sections, 4 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Motivational Example. Two semantically equivalent queries lead to different outputs from a RAG system, despite both responses being factually correct. Such variation may be acceptable in many applications, but in certain high-stakes domains (e.g., healthcare, finance, legal) information consistency across semantically equivalent inputs may be required to ensure reliability, user trust, and compliance.
  • Figure 2: Comparison between Con-RAG and baselines across accuracy and consistency dimensions on LLaMA-3.1-8B and Qwen-2.5-3B. Each plot summarizes performance on a single dataset using accuracy measures (Exact Match, token F1, Relaxed Match) and end-to-end information consistency (measured lexically and via LLM-judge). Con-RAG consistently outperforms prior methods across models, achieving both higher factual accuracy and more consistent responses across paraphrased inputs (see Table \ref{['tab:main-results']} for full numerical results).
  • Figure 3: Overview of PS-GRPO and Information Consistent RAG (Con-RAG) framework. A canonical query $q$ is expanded into a set of paraphrases $\{p_1,\dots,p_n\}$, each of which is passed through the policy LLM to generate $g$ sampled rollouts. For every rollout $o_{ij}$, we compute a group similarity reward $r_{ij}$ by averaging its similarity with outputs from other paraphrases of the same query (this produces an $n \times g$ reward matrix). Normalized advantages are then computed within each paraphrase set, and the policy model is updated.