Table of Contents
Fetching ...

Reward Model Generalization for Compute-Aware Test-Time Reasoning

Zeen Song, Wenwen Qiang, Siyu Zhao, Changwen Zheng, Gang Hua

TL;DR

The paper studies how the generalization ability of a Process Reward Model (PRM) affects compute-optimal external test-time reasoning (TTS) in large language models. It derives PAC-Bayes-based generalization bounds and connects them to final answer accuracy and compute budget, highlighting the risk of mis-ranking candidate reasoning paths due to reward prediction error. Motivated by these insights, it proposes Compute-Aware Tree Search (CATS), an A2C-based controller that dynamically allocates compute by balancing compute cost, reward margins, and PRM scores, using sparsity as a proxy for generalization. Empirical results on MATH-500 and AIME24 across multiple policy models and PRMs show that CATS consistently outperforms standard external TTS methods, validating the theoretical predictions and demonstrating practical gains in compute efficiency and accuracy.

Abstract

External test-time reasoning enhances large language models (LLMs) by decoupling generation and selection. At inference time, the model generates multiple reasoning paths, and an auxiliary process reward model (PRM) is used to score and select the best one. A central challenge in this setting is test-time compute optimality (TCO), i.e., how to maximize answer accuracy under a fixed inference budget. In this work, we establish a theoretical framework to analyze how the generalization error of the PRM affects compute efficiency and reasoning performance. Leveraging PAC-Bayes theory, we derive generalization bounds and show that a lower generalization error of PRM leads to fewer samples required to find correct answers. Motivated by this analysis, we propose Compute-Aware Tree Search (CATS), an actor-critic framework that dynamically controls search behavior. The actor outputs sampling hyperparameters based on reward distributions and sparsity statistics, while the critic estimates their utility to guide budget allocation. Experiments on the MATH and AIME benchmarks with various LLMs and PRMs demonstrate that CATS consistently outperforms other external TTS methods, validating our theoretical predictions.

Reward Model Generalization for Compute-Aware Test-Time Reasoning

TL;DR

The paper studies how the generalization ability of a Process Reward Model (PRM) affects compute-optimal external test-time reasoning (TTS) in large language models. It derives PAC-Bayes-based generalization bounds and connects them to final answer accuracy and compute budget, highlighting the risk of mis-ranking candidate reasoning paths due to reward prediction error. Motivated by these insights, it proposes Compute-Aware Tree Search (CATS), an A2C-based controller that dynamically allocates compute by balancing compute cost, reward margins, and PRM scores, using sparsity as a proxy for generalization. Empirical results on MATH-500 and AIME24 across multiple policy models and PRMs show that CATS consistently outperforms standard external TTS methods, validating the theoretical predictions and demonstrating practical gains in compute efficiency and accuracy.

Abstract

External test-time reasoning enhances large language models (LLMs) by decoupling generation and selection. At inference time, the model generates multiple reasoning paths, and an auxiliary process reward model (PRM) is used to score and select the best one. A central challenge in this setting is test-time compute optimality (TCO), i.e., how to maximize answer accuracy under a fixed inference budget. In this work, we establish a theoretical framework to analyze how the generalization error of the PRM affects compute efficiency and reasoning performance. Leveraging PAC-Bayes theory, we derive generalization bounds and show that a lower generalization error of PRM leads to fewer samples required to find correct answers. Motivated by this analysis, we propose Compute-Aware Tree Search (CATS), an actor-critic framework that dynamically controls search behavior. The actor outputs sampling hyperparameters based on reward distributions and sparsity statistics, while the critic estimates their utility to guide budget allocation. Experiments on the MATH and AIME benchmarks with various LLMs and PRMs demonstrate that CATS consistently outperforms other external TTS methods, validating our theoretical predictions.

Paper Structure

This paper contains 33 sections, 6 theorems, 33 equations, 4 figures, 6 tables, 2 algorithms.

Key Result

Theorem 4.2

Let $P, Q \in \mathcal{P}(\Phi)$ be any prior and posterior distributions over $\phi$, and let $\ell$ be a bounded loss function taking values in $[0,1]$. Then, for any $\delta \in (0,1]$, with probability at least $1 - \delta$ over the choice of training set $\mathcal{S} \sim \mathcal{D}^n$, the fo

Figures (4)

  • Figure 1: The comparison results on the MATH-500 dataset for different policy models and PRMs.
  • Figure 2: The comparison results on the AIME24 dataset for different policy models and PRMs.
  • Figure 3: Full results on the MATH500 dataset.
  • Figure 4: Full results on the AIME24 dataset.

Theorems & Definitions (9)

  • Theorem 4.2: PAC-Bayes Generalization Bound for PRMs
  • Theorem 4.5: Answer–Accuracy Bound with Reward‐Gap
  • Corollary 4.6: Target Accuracy Constraint on Sampling and Margin
  • Theorem B.1: PAC-Bayes Generalization Bound for Reward Models
  • proof
  • Theorem C.1: Answer–Accuracy Bound with Reward‐Gap Parameter
  • proof
  • Corollary D.1: Target Accuracy Constraint on Sampling and Margin
  • proof