Table of Contents
Fetching ...

Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain

Corentin Royer, Debarun Bhattacharjya, Gaetano Rossiello, Andrea Giovannini, Mennatallah El-Assady

Abstract

Multi-step reasoning improves the capabilities of large language models (LLMs) but increases the risk of errors propagating through intermediate steps. Process reward models (PRMs) mitigate this by scoring each step individually, enabling fine-grained supervision and improved reliability. Existing methods for training PRMs rely on costly human annotations or computationally intensive automatic labeling. We propose a novel approach to automatically generate step-level labels using Information Theory. Our method estimates how each reasoning step affects the likelihood of the correct answer, providing a signal of step quality. Importantly, it reduces computational complexity to $\mathcal{O}(N)$, improving over the previous $\mathcal{O}(N \log N)$ methods. We demonstrate that these labels enable effective chain-of-thought selection in best-of-$K$ evaluation settings across diverse reasoning benchmarks, including mathematics, Python programming, SQL, and scientific question answering. This work enables scalable and efficient supervision of LLM reasoning, particularly for tasks where error propagation is critical.

Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain

Abstract

Multi-step reasoning improves the capabilities of large language models (LLMs) but increases the risk of errors propagating through intermediate steps. Process reward models (PRMs) mitigate this by scoring each step individually, enabling fine-grained supervision and improved reliability. Existing methods for training PRMs rely on costly human annotations or computationally intensive automatic labeling. We propose a novel approach to automatically generate step-level labels using Information Theory. Our method estimates how each reasoning step affects the likelihood of the correct answer, providing a signal of step quality. Importantly, it reduces computational complexity to , improving over the previous methods. We demonstrate that these labels enable effective chain-of-thought selection in best-of- evaluation settings across diverse reasoning benchmarks, including mathematics, Python programming, SQL, and scientific question answering. This work enables scalable and efficient supervision of LLM reasoning, particularly for tasks where error propagation is critical.
Paper Structure (39 sections, 18 equations, 8 figures, 9 tables)

This paper contains 39 sections, 18 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Best-of-K performance on mathematics (MATH, GSM, and AIME), Python (HumanEval and BigCodeBench), and SQL (Bird). The PRM trained on MCNIG labels achieves the highest performance at all values of the number of candidates $K$. The improvement over majority voting increases with increasing $K$.
  • Figure 2: Comparison between the labeling performance of Information Gain and Monte Carlo Net Information Gain in terms of balanced accuracy of the chain-of-thought score of each candidate.
  • Figure 3: Beeswarm chart from the SHAP analysis of the labeled dataset using IG (top figure) and MCNIG (bottom figure). We see the impact of three features: the length of the ground truth, the number of steps, and problem difficulty. Note that IG fails on problems with longer GT answers whereas MCNIG has consistent performance.
  • Figure 4: Distribution of the number of steps in the chain of thoughts of all the datasets combined.
  • Figure 5: Average BoK performance over the 6 tasks for PRMs trained on different subsets of the training set, ranging from 100k to the full 1.2M samples.
  • ...and 3 more figures