Table of Contents
Fetching ...

EviNote-RAG: Enhancing RAG Models via Answer-Supportive Evidence Notes

Yuqin Dai, Guoqing Wang, Yuan Wang, Kairan Dou, Kaichen Zhou, Zhanwei Zhang, Shuo Yang, Fei Tang, Jun Yin, Pengyu Zeng, Zhenzhe Ying, Can Yi, Changhua Meng, Yuchen Zhou, Yongliang Shen, Shuai Lu

TL;DR

This work tackles the challenge of noisy external evidence and error propagation in Retrieval-Augmented Generation by introducing EviNote-RAG, which restructures the retrieval loop into a retrieve–note–answer pipeline. It generates Supportive-Evidence Notes (SENs) that distill answer-relevant information and annotate key and uncertain content, and it employs an entailment-based Evidence Quality Reward (EQR) to ensure SENs can logically derive the final answer. Through end-to-end reinforcement learning with GRPO, the framework achieves state-of-the-art results on both in-domain and out-of-domain QA benchmarks, while improving training stability and efficiency. The approach provides a principled recipe for integrating structured note-taking with reward design to produce more interpretable, faithful, and robust RAG systems.

Abstract

Retrieval-Augmented Generation (RAG) has advanced open-domain question answering by incorporating external information into model reasoning. However, effectively leveraging external information to enhance reasoning presents the following challenges: (1) low signal-to-noise ratio, where answer-supportive external information is diluted by irrelevant material, and (2) error accumulation, which arises in multi-hop reasoning when incomplete or misleading information is incorporated. To address these challenges, we introduce EviNote-RAG, a framework that follows a retrieve-note-answer workflow. Instead of reasoning directly over raw external information, the model first produces Supportive-Evidence Notes (SENs), which concisely preserve answer-critical information and explicitly mark key and uncertainty information to improve accuracy. We further design an entailment-based Evidence Quality Reward (EQR) to ensure that SENs are logically sufficient to derive the final answer, thereby enhancing SENs' quality. Experiments on both in-domain and out-of-domain QA benchmarks show that EviNote-RAG achieves state-of-the-art performance, improving answer accuracy, training stability, robustness, and efficiency. In particular, it yields relative F1 gains of 20% on HotpotQA (+0.093), 40% on Bamboogle (+0.151), and 91% on 2Wiki (+0.256), benefiting from improvements in the reasoning process.

EviNote-RAG: Enhancing RAG Models via Answer-Supportive Evidence Notes

TL;DR

This work tackles the challenge of noisy external evidence and error propagation in Retrieval-Augmented Generation by introducing EviNote-RAG, which restructures the retrieval loop into a retrieve–note–answer pipeline. It generates Supportive-Evidence Notes (SENs) that distill answer-relevant information and annotate key and uncertain content, and it employs an entailment-based Evidence Quality Reward (EQR) to ensure SENs can logically derive the final answer. Through end-to-end reinforcement learning with GRPO, the framework achieves state-of-the-art results on both in-domain and out-of-domain QA benchmarks, while improving training stability and efficiency. The approach provides a principled recipe for integrating structured note-taking with reward design to produce more interpretable, faithful, and robust RAG systems.

Abstract

Retrieval-Augmented Generation (RAG) has advanced open-domain question answering by incorporating external information into model reasoning. However, effectively leveraging external information to enhance reasoning presents the following challenges: (1) low signal-to-noise ratio, where answer-supportive external information is diluted by irrelevant material, and (2) error accumulation, which arises in multi-hop reasoning when incomplete or misleading information is incorporated. To address these challenges, we introduce EviNote-RAG, a framework that follows a retrieve-note-answer workflow. Instead of reasoning directly over raw external information, the model first produces Supportive-Evidence Notes (SENs), which concisely preserve answer-critical information and explicitly mark key and uncertainty information to improve accuracy. We further design an entailment-based Evidence Quality Reward (EQR) to ensure that SENs are logically sufficient to derive the final answer, thereby enhancing SENs' quality. Experiments on both in-domain and out-of-domain QA benchmarks show that EviNote-RAG achieves state-of-the-art performance, improving answer accuracy, training stability, robustness, and efficiency. In particular, it yields relative F1 gains of 20% on HotpotQA (+0.093), 40% on Bamboogle (+0.151), and 91% on 2Wiki (+0.256), benefiting from improvements in the reasoning process.

Paper Structure

This paper contains 76 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: EviNote-RAG vs. Baselines song2025r1searcherjin2025search: EviNote-RAG distills key information through evidence notes and, guided by an Entailment Judge, ensures that retained content directly supports the answer, thereby mitigating noise and enhancing performance.
  • Figure 2:
  • Figure 3: Training dynamics illustrating (a) reward, (b) KL Loss, (c) Response Length, and (d) Total Time Per Token (TPT). (e) Ablation study on EQR experiments.
  • Figure 4: Case study on the query "who wrote Knocking' on Heaven's Door?". The baseline model is misled by misleading contextual information (Doc 2 repeatedly frames the song as a Guns N’ Roses piece), resulting in the incorrect answer "Guns N’ Roses". In contrast, our EviNote-RAG model effectively filters out misleading signals, emphasizes key evidence (e.g., writer credit in Doc 1 and Doc 2), and produces the correct answer "Bob Dylan". This highlights the importance of mitigating the interference of false or misleading information in knowledge-intensive tasks.
  • Figure 5: Training dynamics under different summary strategies. (a) actor entropy loss; (b) actor KL loss (w.r.t. the reference policy); (c) mean response length; (d) token-level latency (ms/token). SEN maintains low entropy and KL drift with stable, shorter responses and low latency; NS is slightly less stable but similar in trend; FS achieves low latency at the cost of under-exploration and weaker accuracy; the Base policy exhibits late-stage blow-up in KL/entropy, response-length sprawl, and higher per-token latency.
  • ...and 4 more figures