Table of Contents
Fetching ...

HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction

Shengxuan Qiu, Haochen Huang, Shuzhang Zhong, Pengfei Zuo, Meng Li

TL;DR

HyPER addresses the critical trade-off between exploration and exploitation in test-time scaling for LLM reasoning by reframing it as a dynamic expand–reduce control problem over a pool of hypothesis paths. It introduces a training-free online controller, a token-level SingleToken refinement primitive, and a length- and confidence-aware voting mechanism that together enable adaptive resource allocation under a fixed budget without retraining. The approach yields consistent accuracy gains of about 8–10 percentage points and reduces token usage by 25–40% across four MoE models on diverse reasoning benchmarks, demonstrating strong Pareto efficiency and architectural flexibility. By bridging the existence and selection gap through robust signals and a principled voting scheme, HyPER offers a practical, scalable solution for improving multi-path reasoning in real-world settings.

Abstract

Scaling test-time compute with multi-path chain-of-thought improves reasoning accuracy, but its effectiveness depends critically on the exploration-exploitation trade-off. Existing approaches address this trade-off in rigid ways: tree-structured search hard-codes exploration through brittle expansion rules that interfere with post-trained reasoning, while parallel reasoning over-explores redundant hypothesis paths and relies on weak answer selection. Motivated by the observation that the optimal balance is phase-dependent and that correct and incorrect reasoning paths often diverge only at late stages, we reformulate test-time scaling as a dynamic expand-reduce control problem over a pool of hypotheses. We propose HyPER, a training-free online control policy for multi-path decoding in mixture-of-experts models that reallocates computation under a fixed budget using lightweight path statistics. HyPER consists of an online controller that transitions from exploration to exploitation as the hypothesis pool evolves, a token-level refinement mechanism that enables efficient generation-time exploitation without full-path resampling, and a length- and confidence-aware aggregation strategy for reliable answer-time exploitation. Experiments on four mixture-of-experts language models across diverse reasoning benchmarks show that HyPER consistently achieves a superior accuracy-compute trade-off, improving accuracy by 8 to 10 percent while reducing token usage by 25 to 40 percent.

HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction

TL;DR

HyPER addresses the critical trade-off between exploration and exploitation in test-time scaling for LLM reasoning by reframing it as a dynamic expand–reduce control problem over a pool of hypothesis paths. It introduces a training-free online controller, a token-level SingleToken refinement primitive, and a length- and confidence-aware voting mechanism that together enable adaptive resource allocation under a fixed budget without retraining. The approach yields consistent accuracy gains of about 8–10 percentage points and reduces token usage by 25–40% across four MoE models on diverse reasoning benchmarks, demonstrating strong Pareto efficiency and architectural flexibility. By bridging the existence and selection gap through robust signals and a principled voting scheme, HyPER offers a practical, scalable solution for improving multi-path reasoning in real-world settings.

Abstract

Scaling test-time compute with multi-path chain-of-thought improves reasoning accuracy, but its effectiveness depends critically on the exploration-exploitation trade-off. Existing approaches address this trade-off in rigid ways: tree-structured search hard-codes exploration through brittle expansion rules that interfere with post-trained reasoning, while parallel reasoning over-explores redundant hypothesis paths and relies on weak answer selection. Motivated by the observation that the optimal balance is phase-dependent and that correct and incorrect reasoning paths often diverge only at late stages, we reformulate test-time scaling as a dynamic expand-reduce control problem over a pool of hypotheses. We propose HyPER, a training-free online control policy for multi-path decoding in mixture-of-experts models that reallocates computation under a fixed budget using lightweight path statistics. HyPER consists of an online controller that transitions from exploration to exploitation as the hypothesis pool evolves, a token-level refinement mechanism that enables efficient generation-time exploitation without full-path resampling, and a length- and confidence-aware aggregation strategy for reliable answer-time exploitation. Experiments on four mixture-of-experts language models across diverse reasoning benchmarks show that HyPER consistently achieves a superior accuracy-compute trade-off, improving accuracy by 8 to 10 percent while reducing token usage by 25 to 40 percent.
Paper Structure (67 sections, 4 theorems, 40 equations, 15 figures, 7 tables, 2 algorithms)

This paper contains 67 sections, 4 theorems, 40 equations, 15 figures, 7 tables, 2 algorithms.

Key Result

Proposition 6.1

Fix a candidate answer set $\mathcal{A}'\subseteq\mathcal{A}$. Assume the proxy density ratio takes the log-linear form (eq:proxy-ratio) and that the weights are mild in the sense that $|\theta^\top \phi(r_i)|\le \epsilon$ for all $i\in[N]$ and some small $\epsilon$. Then, up to an additive answer-i where $(\lambda_{\mathrm{len}},\lambda_{\mathrm{conf}})$ reparameterize $\theta$ after feature resc

Figures (15)

  • Figure 1: Both the existence probability and the accuracy show an increasing trend as the number of paths increases, yet there is a significant marginal benefit.
  • Figure 2: A representative reasoning example and tail-token confidence patterns of correct vs. incorrect paths.
  • Figure 3: Correct paths are present but outvoted by noisy paths under confidence-weighted voting: each path’s answer receives a weight given by the path’s global average token confidence.
  • Figure 4: Overview of HyPER.
  • Figure 5: Per-instance confidence--diversity scatter plots under isolated actions.
  • ...and 10 more figures

Theorems & Definitions (7)

  • Proposition 6.1: HyPER voting as linearized proxy-IS ranking
  • proof
  • Lemma 6.2: Relative error of frequency (uniform-weight marginalization)
  • proof
  • Corollary 6.3: Support lower bound implied by Top-$K$ truncation
  • Proposition 6.4: This penalty reduces expert collisions in the two-route toy
  • proof