Table of Contents
Fetching ...

Adaptive Parallel Monte Carlo Tree Search for Efficient Test-time Compute Scaling

Hongbeen Kim, Juhyun Lee, Sanghyeon Lee, Kwanghoon Choi, Jaehyuk Huh

Abstract

Monte Carlo Tree Search (MCTS) is an effective test-time compute scaling (TTCS) method for improving the reasoning performance of large language models, but its highly variable execution time leads to severe long-tail latency in practice. Existing optimizations such as positive early exit, reduce latency in favorable cases but are less effective when search continues without meaningful progress. We introduce {\it negative early exit}, which prunes unproductive MCTS trajectories, and an {\it adaptive boosting mechanism} that reallocates reclaimed computation to reduce resource contention among concurrent searches. Integrated into vLLM, these techniques substantially reduce p99 end-to-end latency while improving throughput and maintaining reasoning accuracy.

Adaptive Parallel Monte Carlo Tree Search for Efficient Test-time Compute Scaling

Abstract

Monte Carlo Tree Search (MCTS) is an effective test-time compute scaling (TTCS) method for improving the reasoning performance of large language models, but its highly variable execution time leads to severe long-tail latency in practice. Existing optimizations such as positive early exit, reduce latency in favorable cases but are less effective when search continues without meaningful progress. We introduce {\it negative early exit}, which prunes unproductive MCTS trajectories, and an {\it adaptive boosting mechanism} that reallocates reclaimed computation to reduce resource contention among concurrent searches. Integrated into vLLM, these techniques substantially reduce p99 end-to-end latency while improving throughput and maintaining reasoning accuracy.

Paper Structure

This paper contains 26 sections, 2 equations, 10 figures, 1 table, 1 algorithm.

Figures (10)

  • Figure 1: Two categories of the TTCS method: (a) sequential search and (b) parallel search. t denotes the time.
  • Figure 2: Comparison of search efficiency between Beam Search and MCTS, and the token reduction enabled by early exit, with MCTS using 12 rollouts and Beam Search using 8 beams.
  • Figure 3: End-to-end latency distribution of sequential MCTS (a) with and (b) without early exit. Maximum number of rollouts is 32.
  • Figure 4: Impact of parallelism on latency and accuracy. (a) Average latency per tree search decreases as parallelism increases. (b) Higher parallelism leads to minor fluctuations in accuracy. Results are reported on a subset of challenging problems that require generating a large number of tokens.
  • Figure 5: Overview of negative early exit mechanism with a acceptance threshold of $\tau=0.3$.
  • ...and 5 more figures