Table of Contents
Fetching ...

Two-way Evidence self-Alignment based Dual-Gated Reasoning Enhancement

Kexin Zhang, Junlan Chen, Daifeng Li, Yuxuan Zhang, Yangyang Feng, Bowen Deng, Weixu Chen

TL;DR

This work tackles knowledge-intensive multi-step reasoning by introducing ESA-DGR, a unified framework that combines two modules: TW-ESA for two-way alignment between strict rationale extraction and LLM reasoning, and DGR for gradual fusion of LLM intrinsic knowledge with external evidence. The RIE component within TW-ESA selects concise, causally relevant evidence, while the dual-gated mechanism enables robust reasoning by balancing external evidence with internal knowledge, mitigating uncertainty and hallucination. Collaborative training with GRPO and alignment losses yields improved exact match and F1 scores across HotpotQA, 2WikiMultiHopQA, and Musique, along with better evidence quality and interpretability. The approach demonstrates state-of-the-art performance with more efficient and targeted evidence utilization, offering a scalable path for reliable KIMSR in real-world QA tasks.

Abstract

Large language models (LLMs) encounter difficulties in knowledge-intensive multi-step reasoning (KIMSR) tasks. One challenge is how to effectively extract and represent rationale evidence. The current methods often extract semantically relevant but logically irrelevant evidence, resulting in flawed reasoning and inaccurate responses. We propose a two-way evidence self-alignment (TW-ESA) module, which utilizes the mutual alignment between strict reasoning and LLM reasoning to enhance its understanding of the causal logic of evidence, thereby addressing the first challenge. Another challenge is how to utilize the rationale evidence and LLM's intrinsic knowledge for accurate reasoning when the evidence contains uncertainty. We propose a dual-gated reasoning enhancement (DGR) module to gradually fuse useful knowledge of LLM within strict reasoning, which can enable the model to perform accurate reasoning by focusing on causal elements in the evidence and exhibit greater robustness. The two modules are collaboratively trained in a unified framework ESA-DGR. Extensive experiments on three diverse and challenging KIMSR datasets reveal that ESA-DGR significantly surpasses state-of-the-art LLM-based fine-tuning methods, with remarkable average improvements of 4% in exact match (EM) and 5% in F1 score. The implementation code is available at https://anonymous.4open.science/r/ESA-DGR-2BF8.

Two-way Evidence self-Alignment based Dual-Gated Reasoning Enhancement

TL;DR

This work tackles knowledge-intensive multi-step reasoning by introducing ESA-DGR, a unified framework that combines two modules: TW-ESA for two-way alignment between strict rationale extraction and LLM reasoning, and DGR for gradual fusion of LLM intrinsic knowledge with external evidence. The RIE component within TW-ESA selects concise, causally relevant evidence, while the dual-gated mechanism enables robust reasoning by balancing external evidence with internal knowledge, mitigating uncertainty and hallucination. Collaborative training with GRPO and alignment losses yields improved exact match and F1 scores across HotpotQA, 2WikiMultiHopQA, and Musique, along with better evidence quality and interpretability. The approach demonstrates state-of-the-art performance with more efficient and targeted evidence utilization, offering a scalable path for reliable KIMSR in real-world QA tasks.

Abstract

Large language models (LLMs) encounter difficulties in knowledge-intensive multi-step reasoning (KIMSR) tasks. One challenge is how to effectively extract and represent rationale evidence. The current methods often extract semantically relevant but logically irrelevant evidence, resulting in flawed reasoning and inaccurate responses. We propose a two-way evidence self-alignment (TW-ESA) module, which utilizes the mutual alignment between strict reasoning and LLM reasoning to enhance its understanding of the causal logic of evidence, thereby addressing the first challenge. Another challenge is how to utilize the rationale evidence and LLM's intrinsic knowledge for accurate reasoning when the evidence contains uncertainty. We propose a dual-gated reasoning enhancement (DGR) module to gradually fuse useful knowledge of LLM within strict reasoning, which can enable the model to perform accurate reasoning by focusing on causal elements in the evidence and exhibit greater robustness. The two modules are collaboratively trained in a unified framework ESA-DGR. Extensive experiments on three diverse and challenging KIMSR datasets reveal that ESA-DGR significantly surpasses state-of-the-art LLM-based fine-tuning methods, with remarkable average improvements of 4% in exact match (EM) and 5% in F1 score. The implementation code is available at https://anonymous.4open.science/r/ESA-DGR-2BF8.

Paper Structure

This paper contains 31 sections, 8 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: The proposed ESA-DGR model.
  • Figure 2: Visualization of token representations for rationale selection. Blue: correctly predicted rationale tokens; Red: misclassified tokens by SEER (a) and TW-ESA (b). TW-ESA demonstrates better separation between rationale and non-rationale tokens. The corresponding case is detailed in Appendix\ref{['appendix:case1']}.
  • Figure 3: Answer quality and query efficiency comparison between ESA-DGR and Search-o1 on Qwen2.5-7B and LLaMA-8B. ESA-DGR consistently yields better answers (EM/F1) and higher-value queries ($\mathcal{Q}_{\text{avg}}$, $\mathcal{U}_{\mathcal{C}}$).
  • Figure 4: Evidence quality comparison. ESA-DGR outperforms baseline method SEER in $\mathcal{S}_{\text{evidence}}$, demonstrating superior rationale extraction quality.
  • Figure 5: Sensitivity analysis of five loss-related hyperparameters ($\lambda_1$ to $\lambda_5$) on Qwen2.5-7B using the HotpotQA dataset. Each subplot shows the impact of one hyperparameter on answer accuracy (EM/F1) and reasoning faithfulness (Precision/Recall), demonstrating that ESA-DGR achieves stable performance under a range of settings and peaks consistently around $\lambda=0.5$ or $1.0$.