Table of Contents
Fetching ...

EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models

J. Ben Tamo, Yuxing Lu, Benoit L. Marteau, Micky C. Nnamdi, May D. Wang

Abstract

Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucinations drop nearly 5$\times$ and evidence-supported diagnoses increase from 31.8\% to 61.6\%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8\% to 67.6\% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at https://github.com/Wizaaard/EvidenceRL.git.

EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models

Abstract

Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding () rises from 47.6 to 78.2; hallucinations drop nearly 5 and evidence-supported diagnoses increase from 31.8\% to 61.6\%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8\% to 67.6\% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at https://github.com/Wizaaard/EvidenceRL.git.
Paper Structure (60 sections, 11 equations, 8 figures, 11 tables)

This paper contains 60 sections, 11 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: EvidenceRL aligns task accuracy with faithful evidence use across domains. Training uses GRPO with rewards for correctness ($r_c$), format ($r_f$), and evidence grounding ($r_g$). Grounding is computed via a Focus–Then–Verify procedure: (1) focused (premise, hypothesis) pairs are constructed by combining an anchor context with individual evidence sections, and (2) each pair is scored by a frozen NLI cross-encoder.
  • Figure 2: At $\tau=0.80$, precision remains high for both approaches. Stable recall and high Cohen’s $\kappa$ indicate a conservative reward signal, with no evidence of proxy hacking by GRPO models.
  • Figure 3: All five metrics shift in the same direction under both evaluators, confirming that grounding improvements reflect genuine evidence use rather than reward model overfitting.
  • Figure 4: SFT yields reasonable accuracy but weak grounding. GRPO with correctness reward ($r_c{+}r_f$) maximizes F1, while adding the grounding reward ($r_g$) substantially improves evidence attribution with only minor accuracy trade-offs.
  • Figure 5: Training reward dynamics across model scales and objectives using MIMIC. We illustrate the training progress for the Llama-3.1-8B, Llama-3.2-3B, and the Gemma-3 series (4B, 12B, and 27B) Performance is evaluated across three primary reward components: (left) Format Reward ($r_{f}$) measuring adherence to structural constraints; (center) Accuracy Reward ($r_{c}$) assessing the correctness of generated responses; and (right) Grounding Reward ($\tilde{r}_{g}$) quantifying the extent to which outputs are supported by provided context. Larger model scales generally exhibit higher reward ceilings and more stable convergence across all metrics.
  • ...and 3 more figures