From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

Shubham Mishra; Samyek Jain; Gorang Mehrishi; Shiv Tiwari; Harsh Sharma; Pratik Narang; Dhruv Kumar

From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

Shubham Mishra, Samyek Jain, Gorang Mehrishi, Shiv Tiwari, Harsh Sharma, Pratik Narang, Dhruv Kumar

TL;DR

This work tackles the difficulty of retrieval-augmented generation when retrieved evidence conflicts, is outdated, or is subjective. It introduces a reasoning-trace augmented RAG framework with a three-stage adjudication process and a Conflict-Aware Trust Score to supervise and evaluate grounded, conflict-sensitive reasoning. The approach is instantiated by fine-tuning Qwen and Mistral models with QLoRA on a 539-query dataset, achieving substantial gains in end-to-end and oracle settings, especially in answer correctness and behavioral adherence. The authors release their annotated dataset and CATS evaluation pipeline, and discuss future directions toward broader benchmarks and stronger conflict-type supervision. Overall, the work provides a foundation for interpretable, robust RAG systems that reason over conflicting evidence and refuse when warranted.

Abstract

Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.

From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

TL;DR

Abstract

From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)