DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

Xinyi Wang; Yiping Song; Zhiliang Tian; Bo Liu; Tingjin Luo; Minlie Huang

DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

Xinyi Wang, Yiping Song, Zhiliang Tian, Bo Liu, Tingjin Luo, Minlie Huang

TL;DR

<3-5 sentence high-level summary> The paper addresses multi-hop question answering by integrating Chain of Thought (CoT) with Knowledge Graph (KG) reasoning through a dual implicit process reward model (DPRM). It introduces two implicit PRMs—CoT-PRM and KG-PRM—trained from outcome signals, with KG-PRM leveraging preference pairs to impose graph-structure constraints and a consistency mechanism aligning both PRMs. The framework enables iterative, self-correcting reasoning where KG paths and CoT steps mutually reinforce each other, without requiring explicit step-level annotations. Empirical results on WebQSP and CWQ show state-of-the-art performance, with notable gains in Hit@1, especially on more challenging CWQ, validating the effectiveness of cross-modal reward-driven reasoning in MHQA.</p>

Abstract

In multi-hop question answering (MHQA) tasks, Chain of Thought (CoT) improves the quality of generation by guiding large language models (LLMs) through multi-step reasoning, and Knowledge Graphs (KGs) reduce hallucinations via semantic matching. Outcome Reward Models (ORMs) provide feedback after generating the final answers but fail to evaluate the process for multi-step reasoning. Traditional Process Reward Models (PRMs) evaluate the reasoning process but require costly human annotations or rollout generation. While implicit PRM is trained only with outcome signals and derives step rewards through reward parameterization without explicit annotations, it is more suitable for multi-step reasoning in MHQA tasks. However, existing implicit PRM has only been explored for plain text scenarios. When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture the potential inconsistency between CoT and KG paths. To address these limitations, we propose the DPRM (Dual Implicit Process Reward Model). It trains two implicit PRMs for CoT and KG reasoning in MHQA tasks. Both PRMs, namely KG-PRM and CoT-PRM, derive step-level rewards from outcome signals via reward parameterization without additional explicit annotations. Among them, KG-PRM uses preference pairs to learn structural constraints from KGs. DPRM further introduces a consistency constraint between CoT and KG reasoning steps, making the two PRMs mutually verify and collaboratively optimize the reasoning paths. We also provide a theoretical demonstration of the derivation of process rewards. Experimental results show that our method outperforms 13 baselines on multiple datasets with up to 16.6% improvement on Hit@1.

DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

TL;DR

Abstract

DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (3)