Table of Contents
Fetching ...

Adaptive Uncertainty-Aware Tree Search for Robust Reasoning

Zeen Song, Zihao Ma, Wenwen Qiang, Changwen Zheng, Gang Hua

TL;DR

Uncertainty-Aware Tree Search is proposed, a unified method that estimates uncertainty via Monte Carlo Dropout and dynamically allocates compute budget using a reinforcement learning-based controller and effectively mitigates the impact of OOD errors.

Abstract

Inference-time reasoning scaling has significantly advanced the capabilities of Large Language Models (LLMs) in complex problem-solving. A prevalent approach involves external search guided by Process Reward Models (PRMs). However, a fundamental limitation of this framework is the epistemic uncertainty of PRMs when evaluating reasoning paths that deviate from their training distribution. In this work, we conduct a systematic analysis of this challenge. We first provide empirical evidence that PRMs exhibit high uncertainty and unreliable scoring on out-of-distribution (OOD) samples. We then establish a theoretical framework proving that while standard search incurs linear regret accumulation, an uncertainty-aware strategy can achieve sublinear regret. Motivated by these findings, we propose Uncertainty-Aware Tree Search (UATS), a unified method that estimates uncertainty via Monte Carlo Dropout and dynamically allocates compute budget using a reinforcement learning-based controller. Extensive experiments demonstrate that our approach effectively mitigates the impact of OOD errors.

Adaptive Uncertainty-Aware Tree Search for Robust Reasoning

TL;DR

Uncertainty-Aware Tree Search is proposed, a unified method that estimates uncertainty via Monte Carlo Dropout and dynamically allocates compute budget using a reinforcement learning-based controller and effectively mitigates the impact of OOD errors.

Abstract

Inference-time reasoning scaling has significantly advanced the capabilities of Large Language Models (LLMs) in complex problem-solving. A prevalent approach involves external search guided by Process Reward Models (PRMs). However, a fundamental limitation of this framework is the epistemic uncertainty of PRMs when evaluating reasoning paths that deviate from their training distribution. In this work, we conduct a systematic analysis of this challenge. We first provide empirical evidence that PRMs exhibit high uncertainty and unreliable scoring on out-of-distribution (OOD) samples. We then establish a theoretical framework proving that while standard search incurs linear regret accumulation, an uncertainty-aware strategy can achieve sublinear regret. Motivated by these findings, we propose Uncertainty-Aware Tree Search (UATS), a unified method that estimates uncertainty via Monte Carlo Dropout and dynamically allocates compute budget using a reinforcement learning-based controller. Extensive experiments demonstrate that our approach effectively mitigates the impact of OOD errors.
Paper Structure (33 sections, 2 theorems, 23 equations, 12 figures, 4 tables)

This paper contains 33 sections, 2 theorems, 23 equations, 12 figures, 4 tables.

Key Result

Proposition 4.1

Consider the above scenario. Suppose (i) $\mathbb{P}(\mathcal{O}_t)=\varepsilon$ for all $t$; (ii) under a fixed continuation policy, if $\hat{h}_t\neq h_t^*$ then the final success probability decreases by at least $\underline{\Delta}$; and (iii) there exists a lower bound $\rho\in(0,1]$ such that where $\mathrm{Acc}_T$ is defined as $\mathrm{Acc}_T \triangleq R^*(\hat{h}_T)$.

Figures (12)

  • Figure 1: Visualization of a real beam search step on the MATH dataset. The diagram depicts a selection conflict where the PRM assigns a higher reward ($0.85$) to an incorrect reasoning step (left) containing a calculation error than to a correct step ($0.76$, right). Despite the high reward, the epistemic uncertainty (visualized via boxplots) reveals the unreliability of the incorrect node, which exhibits significantly higher score variance ($\sigma^2=0.032$) compared to the correct node ($\sigma^2=0.001$).
  • Figure 2: (a) Search accuracy across different PRM-Policy pairs. (b) The distribution of score variance across Monte Carlo Dropout for different PRM-Policy pairs
  • Figure 3: Overview of the proposed Uncertainty-Aware Tree Search (UATS) framework. The pipeline consists of three phases: (1) Approximate the PRM's uncertainty using Monte Carlo Dropout. (2) Use uncertainty thresholds ($\tau$) and optimism margins ($\delta$) to selectively re-evaluate ambiguous nodes and allocate expansion budgets. (3) Observes the search state $s_t$ and dynamically outputs action vectors $a_t$ to modulate budget factors and temperature parameters.
  • Figure 4: Accuracy comparison on MATH-500 across different Policy Model and PRM combinations. The x-axis represents the number of candidate paths ($N$) on a logarithmic scale. Our methods, H-UATS and A-UATS, consistently outperform standard baselines.
  • Figure 5: Ablation studies on: (a) Dropout rate. (b) Initial Sampling Count for uncertainty estimation.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Proposition 4.1: Linear degradation under epistemic uncertainty
  • Proposition 4.2: Sublinear degradation under uncertainty-aware selection
  • proof
  • proof