Table of Contents
Fetching ...

Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang, Yanan Zheng, Zeyu Li, Zifan He, Hailei Gong, Zewen Ye, Shengjie Ma, Jianping Zhang

TL;DR

To improve the reliability of reasoning in large language models, the paper introduces Hierarchical Reward Model (HRM) that trains on both single-step and short multi-step reasoning sequences to enforce coherence and self-correction. It further proposes Hierarchical Node Compression (HNC) to augment MCTS-generated data with controlled noise, enabling scalable reward-model training with limited annotation cost. Empirical results on PRM800K show HRM outperforms PRM and ORM in stability and generalizes to GSM8K and MATH500; self-training with high-quality trajectories further boosts policy performance. The work advances reward modeling by enabling robust multi-step reasoning evaluation and offering practical data-augmentation strategies for large-scale LLM reasoning.

Abstract

Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-scale collection of high-quality data challenging. To address this, we propose a novel reward model approach called the Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels. HRM excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection. To further reduce the cost of generating training data, we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC), which merges two consecutive reasoning steps into one within the tree structure. By applying HNC to MCTS-generated reasoning trajectories, we enhance the diversity and robustness of HRM training data while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM. Furthermore, cross-domain evaluations on the MATH500 and GSM8K datasets demonstrate HRM's strong generalization and robustness across a variety of reasoning tasks.

Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

TL;DR

To improve the reliability of reasoning in large language models, the paper introduces Hierarchical Reward Model (HRM) that trains on both single-step and short multi-step reasoning sequences to enforce coherence and self-correction. It further proposes Hierarchical Node Compression (HNC) to augment MCTS-generated data with controlled noise, enabling scalable reward-model training with limited annotation cost. Empirical results on PRM800K show HRM outperforms PRM and ORM in stability and generalizes to GSM8K and MATH500; self-training with high-quality trajectories further boosts policy performance. The work advances reward modeling by enabling robust multi-step reasoning evaluation and offering practical data-augmentation strategies for large-scale LLM reasoning.

Abstract

Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate step. In addition, the cost of annotating reasoning processes for reward modeling is high, making large-scale collection of high-quality data challenging. To address this, we propose a novel reward model approach called the Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels. HRM excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection. To further reduce the cost of generating training data, we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC), which merges two consecutive reasoning steps into one within the tree structure. By applying HNC to MCTS-generated reasoning trajectories, we enhance the diversity and robustness of HRM training data while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM. Furthermore, cross-domain evaluations on the MATH500 and GSM8K datasets demonstrate HRM's strong generalization and robustness across a variety of reasoning tasks.

Paper Structure

This paper contains 23 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustration of how ORM, PRM, and HRM handle reasoning processes. ORM evaluates the entire reasoning chain, PRM assesses individual steps but stops at errors, and HRM considers multiple consecutive steps, enabling error correction. The figure also demonstrates how HRM constructs its training dataset by merging two consecutive steps.
  • Figure 2: Illustration of the MCTS-based automated reasoning annotation process. The left side depicts a tree structure where each node represents a reasoning step, simulated using the ToT approach with MCTS. The right side visualizes the assigned scores for each step in the reasoning tree.
  • Figure 3: Illustration of HNC. The left part represents the original MCTS data annotation structure, while the right part shows the transformed MCTS structure after applying HNC.
  • Figure 4: Loss dynamics during training across different KL loss weightings. Each column corresponds to a different $\lambda$ value: 0.001 (left), 0.5 (middle), and 10.0 (right). The top row shows the log KL loss, while the bottom row depicts the causal language modeling loss.