Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure

Shuhui Qu

Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure

Shuhui Qu

TL;DR

The paper tackles verification-cost-limited reasoning by treating verifier calls as the scarce resource and allocating them at intermediate states rather than globally. It introduces a gated competition framework with (i) deterministic feasibility gates that prune structurally invalid moves, (ii) a hybrid scoring function combining a learned structural distance $D_{ ext{type}}$ and a learned residual $r_\theta(w,m)$ to rank surviving moves, and (iii) state-conditional verification budget $k(w)$ based on local uncertainty. A training recipe uses verifier-labeled candidate lists to learn $r_\theta$, optionally augmented by a remaining-steps trajectory signal to bias toward shorter successful traces. On the MATH benchmark, this approach achieves 55.2% accuracy with 44.8 verifier calls, outperforming best-of-$N$, majority voting, and beam search at the same budget, and demonstrating that fine-grained, state-aware allocation can yield substantial efficiency gains. The work highlights that reducing wasted verification and focusing checks where ambiguity is highest can meaningfully improve the accuracy–cost frontier, with limitations tied to proposal quality and the expressiveness of the structured move interface.

Abstract

Test-time computation has become a primary driver of progress in large language model (LLM) reasoning, but it is increasingly bottlenecked by expensive verification. In many reasoning systems, a large fraction of verifier calls are spent on redundant or unpromising intermediate hypotheses. We study reasoning under a \emph{verification-cost-limited} setting and ask how verification effort should be allocated across intermediate states. We propose a state-level selective verification framework that combines (i) deterministic feasibility gating over a structured move interface, (ii) pre-verification ranking using a hybrid of learned state-distance and residual scoring, and (iii) adaptive allocation of verifier calls based on local uncertainty. Unlike solution-level best-of-$N$ or uniform intermediate verification, our method distributes verification where it is most informative. On the \textsc{MATH} benchmark, our approach achieves higher accuracy than best-of-$N$, majority voting, and beam search while using 44\% fewer verifier calls.

Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure

TL;DR

and a learned residual

to rank surviving moves, and (iii) state-conditional verification budget

based on local uncertainty. A training recipe uses verifier-labeled candidate lists to learn

, optionally augmented by a remaining-steps trajectory signal to bias toward shorter successful traces. On the MATH benchmark, this approach achieves 55.2% accuracy with 44.8 verifier calls, outperforming best-of-

, majority voting, and beam search at the same budget, and demonstrating that fine-grained, state-aware allocation can yield substantial efficiency gains. The work highlights that reducing wasted verification and focusing checks where ambiguity is highest can meaningfully improve the accuracy–cost frontier, with limitations tied to proposal quality and the expressiveness of the structured move interface.

Abstract

or uniform intermediate verification, our method distributes verification where it is most informative. On the \textsc{MATH} benchmark, our approach achieves higher accuracy than best-of-

, majority voting, and beam search while using 44\% fewer verifier calls.

Paper Structure (41 sections, 18 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 41 sections, 18 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Contributions.
Related Work
Test-time compute scaling for reasoning.
Verification, process supervision, and LLM judges.
Adaptive compute and budget allocation.
Structured generation, feasibility filtering, and constrained decoding.
Retrieval and reuse in reasoning.
Contextual bandits and resource allocation.
Problem Formulation: verification-cost-limited reasoning
State.
Moves and structured interface.
Verifier.
Goal and objective.
Method
...and 26 more sections

Figures (3)

Figure 1: Overall framework.
Figure 2: Budget-matched comparison across inference strategies. Accuracy on MATH-500 versus number of generations per problem $N$ (x-axis). We report Majoritiy voting, solution-level Best-of-$N$ (weighted), Beam search ($b{=}4$), and our intermediate-state allocation method.
Figure 3: Backbone scaling for our framework on MATH-500. We plot accuracy versus the number of generations per problem (i.e., the verifier-call budget) for two backbones: Llama 3.2 1B and Llama 3.2 3B. Dashed horizontal lines indicate the 1-shot baselines of Llama 3.2 1B and Llama 3.1 70B for context.

Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure

TL;DR

Abstract

Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure

Authors

TL;DR

Abstract

Table of Contents

Figures (3)