Table of Contents
Fetching ...

Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing

Tong Zheng, Chengsong Huang, Runpeng Dai, Yun He, Rui Liu, Xin Ni, Huiwen Bao, Kaishen Wang, Hongtu Zhu, Jiaxin Huang, Furong Huang, Heng Huang

TL;DR

This work tackles the inefficiency of parallel thinking in large language models by introducing 2D probing, a diagnostic interface that reveals global width-depth dynamics across parallel reasoning branches. Guided by three key insights—non-monotonic width-depth scaling, heterogeneous branch lengths, and early stabilization of global consensus—the authors propose Parallel-Probe, a training-free online controller that uses consensus-based early stopping and deviation-based branch pruning to coordinate parallel generation. To enable principled evaluation of width-depth strategies, they introduce SCOUT, an offline testbed that decouples generation from policy evaluation. Across multiple model scales and hard benchmarks (AIME 2024/2025 and HMMT 2025), Parallel-Probe achieves a superior accuracy-efficiency Pareto frontier, reducing sequential tokens by up to 35.8% and total token cost by over 25.8% while maintaining competitive accuracy.

Abstract

Parallel thinking has emerged as a promising paradigm for reasoning, yet it imposes significant computational burdens. Existing efficiency methods primarily rely on local, per-trajectory signals and lack principled mechanisms to exploit global dynamics across parallel branches. We introduce 2D probing, an interface that exposes the width-depth dynamics of parallel thinking by periodically eliciting intermediate answers from all branches. Our analysis reveals three key insights: non-monotonic scaling across width-depth allocations, heterogeneous reasoning branch lengths, and early stabilization of global consensus. Guided by these insights, we introduce $\textbf{Parallel-Probe}$, a training-free controller designed to optimize online parallel thinking. Parallel-Probe employs consensus-based early stopping to regulate reasoning depth and deviation-based branch pruning to dynamically adjust width. Extensive experiments across three benchmarks and multiple models demonstrate that Parallel-Probe establishes a superior Pareto frontier for test-time scaling. Compared to standard majority voting, it reduces sequential tokens by up to $\textbf{35.8}$% and total token cost by over $\textbf{25.8}$% while maintaining competitive accuracy.

Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing

TL;DR

This work tackles the inefficiency of parallel thinking in large language models by introducing 2D probing, a diagnostic interface that reveals global width-depth dynamics across parallel reasoning branches. Guided by three key insights—non-monotonic width-depth scaling, heterogeneous branch lengths, and early stabilization of global consensus—the authors propose Parallel-Probe, a training-free online controller that uses consensus-based early stopping and deviation-based branch pruning to coordinate parallel generation. To enable principled evaluation of width-depth strategies, they introduce SCOUT, an offline testbed that decouples generation from policy evaluation. Across multiple model scales and hard benchmarks (AIME 2024/2025 and HMMT 2025), Parallel-Probe achieves a superior accuracy-efficiency Pareto frontier, reducing sequential tokens by up to 35.8% and total token cost by over 25.8% while maintaining competitive accuracy.

Abstract

Parallel thinking has emerged as a promising paradigm for reasoning, yet it imposes significant computational burdens. Existing efficiency methods primarily rely on local, per-trajectory signals and lack principled mechanisms to exploit global dynamics across parallel branches. We introduce 2D probing, an interface that exposes the width-depth dynamics of parallel thinking by periodically eliciting intermediate answers from all branches. Our analysis reveals three key insights: non-monotonic scaling across width-depth allocations, heterogeneous reasoning branch lengths, and early stabilization of global consensus. Guided by these insights, we introduce , a training-free controller designed to optimize online parallel thinking. Parallel-Probe employs consensus-based early stopping to regulate reasoning depth and deviation-based branch pruning to dynamically adjust width. Extensive experiments across three benchmarks and multiple models demonstrate that Parallel-Probe establishes a superior Pareto frontier for test-time scaling. Compared to standard majority voting, it reduces sequential tokens by up to % and total token cost by over % while maintaining competitive accuracy.
Paper Structure (41 sections, 3 equations, 7 figures, 2 tables)

This paper contains 41 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of the Parallel-Probe framework. It monitors $N$ parallel reasoning branches via continuous 2D probing. (1) Divergence Pruning: Outlying trajectories that drift from the global majority (e.g., Branch 4) are aggressively pruned to save compute. (2) Stability Stopping: The global controller halts the entire ensemble once the consensus stabilizes, preventing the execution of redundant post-convergence steps (dashed area). Crucially, Parallel-Probe is model-agnostic and compatible with various off-the-shelf LLMs. We evaluate Performance, Cost Efficiency, and Latency Efficiency across 0.6B and 1.7B models. Values are averaged across all datasets and normalized such that the best-performing method on each axis equals 1.0. Parallel-Probe (blue) achieves the largest coverage area, demonstrating a superior balance between high accuracy and computational efficiency compared to SC and ESC methods.
  • Figure 2: Analysis of Model Performance and Dynamics. Detailed experimental setups and additional examples for subfigures (a), (b), and (c) are provided in Appendix \ref{['appedix:detail']}. (a) AIME24 performance of Qwen3-0.6B across varying branch numbers and lengths. The accuracy is measured via Majority Voting. Red lines indicate fixed total token budgets (branch length $\times$ number of branches), ranging from $32\mathrm{K}$ to $256\mathrm{K}$. (b) Answer convergence behavior for a representative AIME25 question using Qwen3-4B across different probing steps. Red denotes the group corresponding to the correct answer at each step, while other colors represent distinct incorrect answer groups. (c) Convergence patterns across different models and datasets. We report the convergence onset ratio, defined as the probing step at which the final majority answer first becomes consensus over the maximum branch length.
  • Figure 3: Accuracy--token scaling curves comparing the SC, SC+SAC, and our Parallel-Probe across different models and benchmarks. Notably, we show the results of SC+SAC under three different settings ($n$=14, $n$=16, $n$=18). The x-axis is shown in log scale. Parallel-Probe consistently achieves higher accuracy under the same or lower token budget.
  • Figure 4: Hyper-parameter sensitivity analysis of Parallel-Probe under different prune patience $k$ and warm-up steps $W$ on Qwen-0.6B and Qwen-1.7B across AIME24 and AIME25.
  • Figure 5: Coverage density across varying branch counts and lengths (Qwen3-0.6B, AIME25). Colors indicate the volume of questions with available majority-voting results. The red box highlights the high-coverage region used to mitigate bias from uneven response lengths during accuracy estimation.
  • ...and 2 more figures