Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization
Jiaqi Wei, Hao Zhou, Xiang Zhang, Di Zhang, Zijie Qiu, Wei Wei, Jinzhe Li, Wanli Ouyang, Siqi Sun
TL;DR
This work reframes Retrieval-Augmented Generation as Retrieval-Augmented Reasoning and tackles reasoning misalignment between a model's internal reasoning and retrieved evidence. It introduces Critique-Driven Alignment (CDA) and a retrieval-augmented Critic Language Model (CLM) trained via Contrastive Critique Synthesis to iteratively align reasoning with evidence. AlignRAG and its autonomous variant AlignRAG-auto dynamically refine responses, achieving state-of-the-art results across seven QA benchmarks and three model families, including strong robustness to noisy retrieval. The approach is plug-and-play, scalable at test time, and improves out-of-domain generalization while reducing computation through dynamic stopping in AlignRAG-auto.
Abstract
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs). However, standard RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions. In this work, we reinterpret RAG as Retrieval-Augmented Reasoning and identify a central but underexplored problem: Reasoning Misalignment -- the divergence between an LLM's internal reasoning trajectory and the evidential constraints provided by retrieval. To address this issue, we propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA). We further introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations. At the heart of AlignRAG lies a contrastive critique synthesis mechanism that generates retrieval-sensitive critiques while mitigating self-bias. This mechanism trains a dedicated retrieval-augmented Critic Language Model (CLM) using labeled critiques that distinguish between evidence-aligned and misaligned reasoning. Empirical evaluations show that our approach significantly improves reasoning fidelity. Our 8B-parameter CLM improves performance over the Self-Refine baseline by 12.1% on out-of-domain tasks and outperforms a standard 72B-parameter CLM by 2.2%. Furthermore, AlignRAG-auto achieves this state-of-the-art performance while dynamically determining the optimal number of refinement steps, enhancing efficiency and usability. AlignRAG remains compatible with existing RAG architectures as a plug-and-play module and demonstrates strong robustness under both informative and noisy retrieval scenarios.
