Table of Contents
Fetching ...

DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

Xinyi Wang, Yiping Song, Zhiliang Tian, Bo Liu, Tingjin Luo, Minlie Huang

TL;DR

<3-5 sentence high-level summary> The paper addresses multi-hop question answering by integrating Chain of Thought (CoT) with Knowledge Graph (KG) reasoning through a dual implicit process reward model (DPRM). It introduces two implicit PRMs—CoT-PRM and KG-PRM—trained from outcome signals, with KG-PRM leveraging preference pairs to impose graph-structure constraints and a consistency mechanism aligning both PRMs. The framework enables iterative, self-correcting reasoning where KG paths and CoT steps mutually reinforce each other, without requiring explicit step-level annotations. Empirical results on WebQSP and CWQ show state-of-the-art performance, with notable gains in Hit@1, especially on more challenging CWQ, validating the effectiveness of cross-modal reward-driven reasoning in MHQA.</p>

Abstract

In multi-hop question answering (MHQA) tasks, Chain of Thought (CoT) improves the quality of generation by guiding large language models (LLMs) through multi-step reasoning, and Knowledge Graphs (KGs) reduce hallucinations via semantic matching. Outcome Reward Models (ORMs) provide feedback after generating the final answers but fail to evaluate the process for multi-step reasoning. Traditional Process Reward Models (PRMs) evaluate the reasoning process but require costly human annotations or rollout generation. While implicit PRM is trained only with outcome signals and derives step rewards through reward parameterization without explicit annotations, it is more suitable for multi-step reasoning in MHQA tasks. However, existing implicit PRM has only been explored for plain text scenarios. When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture the potential inconsistency between CoT and KG paths. To address these limitations, we propose the DPRM (Dual Implicit Process Reward Model). It trains two implicit PRMs for CoT and KG reasoning in MHQA tasks. Both PRMs, namely KG-PRM and CoT-PRM, derive step-level rewards from outcome signals via reward parameterization without additional explicit annotations. Among them, KG-PRM uses preference pairs to learn structural constraints from KGs. DPRM further introduces a consistency constraint between CoT and KG reasoning steps, making the two PRMs mutually verify and collaboratively optimize the reasoning paths. We also provide a theoretical demonstration of the derivation of process rewards. Experimental results show that our method outperforms 13 baselines on multiple datasets with up to 16.6% improvement on Hit@1.

DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

TL;DR

<3-5 sentence high-level summary> The paper addresses multi-hop question answering by integrating Chain of Thought (CoT) with Knowledge Graph (KG) reasoning through a dual implicit process reward model (DPRM). It introduces two implicit PRMs—CoT-PRM and KG-PRM—trained from outcome signals, with KG-PRM leveraging preference pairs to impose graph-structure constraints and a consistency mechanism aligning both PRMs. The framework enables iterative, self-correcting reasoning where KG paths and CoT steps mutually reinforce each other, without requiring explicit step-level annotations. Empirical results on WebQSP and CWQ show state-of-the-art performance, with notable gains in Hit@1, especially on more challenging CWQ, validating the effectiveness of cross-modal reward-driven reasoning in MHQA.</p>

Abstract

In multi-hop question answering (MHQA) tasks, Chain of Thought (CoT) improves the quality of generation by guiding large language models (LLMs) through multi-step reasoning, and Knowledge Graphs (KGs) reduce hallucinations via semantic matching. Outcome Reward Models (ORMs) provide feedback after generating the final answers but fail to evaluate the process for multi-step reasoning. Traditional Process Reward Models (PRMs) evaluate the reasoning process but require costly human annotations or rollout generation. While implicit PRM is trained only with outcome signals and derives step rewards through reward parameterization without explicit annotations, it is more suitable for multi-step reasoning in MHQA tasks. However, existing implicit PRM has only been explored for plain text scenarios. When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture the potential inconsistency between CoT and KG paths. To address these limitations, we propose the DPRM (Dual Implicit Process Reward Model). It trains two implicit PRMs for CoT and KG reasoning in MHQA tasks. Both PRMs, namely KG-PRM and CoT-PRM, derive step-level rewards from outcome signals via reward parameterization without additional explicit annotations. Among them, KG-PRM uses preference pairs to learn structural constraints from KGs. DPRM further introduces a consistency constraint between CoT and KG reasoning steps, making the two PRMs mutually verify and collaboratively optimize the reasoning paths. We also provide a theoretical demonstration of the derivation of process rewards. Experimental results show that our method outperforms 13 baselines on multiple datasets with up to 16.6% improvement on Hit@1.

Paper Structure

This paper contains 35 sections, 2 theorems, 14 equations, 15 figures, 4 tables, 1 algorithm.

Key Result

Proposition 3.1

(Proof in Appendix A) Consider an ORM where the reward is $r_{\theta} (y)=\beta \log \frac{\pi_{\theta}(y)}{\pi_{ref}(y)}$. Define $q_{\theta}^{t}\left(y_{<t}, y_{t}\right):=\sum_{i=1}^{t} \beta \log \frac{\pi_{\theta}\left(y_{i} \mid y_{<i}\right)}{\pi_{r e f}\left(y_{i} \mid y_{<i}\right)}$. $q_\t Hence, $q_\theta^t$ represents an exact expectation of outcome reward $r_\theta$ at step $t$.

Figures (15)

  • Figure 1: The overview of DPRM. (a) Dual PRM Training trains CoT-PRM and KG-PRM with outcome signals. KG-PRM uses preference pairs to learn structural constraints. Co-training of both PRMs makes them mutually verify and collaboratively optimize the KG paths and CoTs. (b) Iterative Reasoning contains 4 parts: ① KG Path and CoT Initialization, ② KG triple screening and reconstruction (use KG-PRM), ③ CoT step generation (use CoT-PRM), and ④ Final Answer Generation.
  • Figure 2: True samples and false samples.
  • Figure 3: Triple reconstruction for entity consistency.
  • Figure 4: Performances on the CoT-PRMs trained on original and KG-derived data.
  • Figure 5: Proportion of Same and Different Entities/Relations in CoT-Derived KG Data.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Proposition 3.1
  • Proposition A.1
  • proof