SEER: Facilitating Structured Reasoning and Explanation via Reinforcement Learning

Guoxin Chen; Kexin Tang; Chao Yang; Fuying Ye; Yu Qiao; Yiming Qian

SEER: Facilitating Structured Reasoning and Explanation via Reinforcement Learning

Guoxin Chen, Kexin Tang, Chao Yang, Fuying Ye, Yu Qiao, Yiming Qian

TL;DR

This paper proposes SEER, a novel method that maximizes a structure-based return to facilitate structured reasoning and explanation, and introduces a fine-grained reward function to meticulously delineate diverse reasoning steps.

Abstract

Elucidating the reasoning process with structured explanations from question to answer is crucial, as it significantly enhances the interpretability, traceability, and trustworthiness of question-answering (QA) systems. However, structured explanations demand models to perform intricately structured reasoning, which poses great challenges. Most existing methods focus on single-step reasoning through supervised learning, ignoring logical dependencies between steps. Moreover, existing reinforcement learning (RL) based methods overlook the structured relationships, underutilizing the potential of RL in structured reasoning. In this paper, we propose SEER, a novel method that maximizes a structure-based return to facilitate structured reasoning and explanation. Our proposed structure-based return precisely describes the hierarchical and branching structure inherent in structured reasoning, effectively capturing the intricate relationships between different reasoning steps. In addition, we introduce a fine-grained reward function to meticulously delineate diverse reasoning steps. Extensive experiments show that SEER significantly outperforms state-of-the-art methods, achieving an absolute improvement of 6.9% over RL-based methods on EntailmentBank, a 4.4% average improvement on STREET benchmark, and exhibiting outstanding efficiency and cross-dataset generalization performance. Our code is available at https://github.com/Chen-GX/SEER.

SEER: Facilitating Structured Reasoning and Explanation via Reinforcement Learning

TL;DR

Abstract

Paper Structure (53 sections, 11 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 53 sections, 11 equations, 10 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Explanation for Question Answering
Natural Language Reasoning
Method
Task Formulation
Overview
Fine-grained Component of Seer
State
Action
Policy
Entailment Module
Reward
Critic
Optimization
...and 38 more sections

Figures (10)

Figure 1: An example of structured explanation. Given a hypothesis $h$ (a declarative sentence derived from a question-answer pair) and a set of facts (or corpus), the goal is to generate a structured explanation, which delineates the reasoning process from facts to the hypothesis.
Figure 2: Overall framework of Seer. For trajectory rollout, action generation (Policy) and conclusion generation (Entailment) are performed alternately. The orange area details the reasoning process from $s_t$ to $s_{t+1}$. For policy optimization, the reward module assigns rewards and updates the policy and critic based on tree or graph structures.
Figure 3: Parameter sensitivity analysis.
Figure 4: An illustration of the reasoning process of Seer. Note that $a_1$ is a correct step, $a_2$ and $a_4$ are erroneous steps, and $a_3$ is a redundant step. We start from the initial state $s_1$ where existing entailment steps $P_1=\varnothing$ and candidate sentences $C_1=X$. In each step, we sample an action and update the state until the reasoning is done. For the "Reason" action, we sent the premises to the entailment module. The new conclusion is added to the $C$, the premises is removed from $C$ and the entailment step is added to the $P$. For the "End" action, we end the reasoning process and output the trajectory.
Figure 5: An illustration of the reward and alignment process of Seer. Each reasoning step is a subtree (similarly, each reasoning step is a subgraph in the reasoning graph DBLP:conf/iclr/Ribeiro0MZDKBRH23). (1) We construct $T_{\text{pred}}$ using the last intermediate conclusion ($i_4$ in this example) as the hypothesis. (2) We calculate the Jaccard similarity between the intermediate node ($i_*$) in $T_{\text{pred}}$ and each golden intermediate node in $T_{\text{gold}}$ ($\hat{i}_1$ and $h$ in this example), and align with the maximum Jaccard similarity. In this example, $i_1$ is aligned with $\hat{i}_1$ due to $\text{JS}(i_1, \hat{i}_1) = 1$. $i_2$ is aligned with "NULL". $i_4$ is aligned with $\hat{i}_1$ due to $\text{JS}(i_4, \hat{i}_1)=0.5$ and $\text{JS}(i_4, h)=0.4$. (3) We assign rewards based on the alignment results. Note that $i_3$ ($s_3$) is a redundant step. $r_{1}=1$, $r_{2}=-1$, $r_{3}=-0.5$, and $r_{4}=-1$. The reward for each state originates from the tree structure rather than the chained trajectory. Therefore, the return of each state should also follow the tree structure (or graph structure in reasoning graphs) rather than the chained trajectory.
...and 5 more figures

SEER: Facilitating Structured Reasoning and Explanation via Reinforcement Learning

TL;DR

Abstract

SEER: Facilitating Structured Reasoning and Explanation via Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)