Table of Contents
Fetching ...

Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls

Ante Wang, Linfeng Song, Ye Tian, Dian Yu, Haitao Mi, Xiangyu Duan, Zhaopeng Tu, Jinsong Su, Dong Yu

TL;DR

This work tackles inefficiencies in LLM reasoning that arise when tree search is guided by verifiers, identifying over-exploration from redundant states and under-exploration from high score variance. It introduces FETCH, a plug-in framework that combines semantic state merging via agglomerative clustering of embeddings (post-processed with signals from prompting or consistency checks) with variance reduction techniques: TD($\lambda$) training for verifiers and ensemble scoring at inference. Empirically, FETCH reduces token costs and boosts accuracy across BFS, Beam Search, Tree Search, and MCTS on GSM8K, GSM-Plus, and MATH datasets, with state merging cutting costs by up to ~3x in some cases and variance reduction providing consistent 1–2 point gains. The results underscore FETCH’s potential to make sophisticated LLM-based reasoning more efficient and scalable, enabling broader practical deployment of guided tree-search methods in complex domains.

Abstract

Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs), but at the cost of increased computational resources. In this work, we identify two key challenges contributing to this inefficiency: $\textit{over-exploration}$ due to redundant states with semantically equivalent content, and $\textit{under-exploration}$ caused by high variance in verifier scoring leading to frequent trajectory switching. To address these issues, we propose FETCH, an e$\textbf{f}$fici$\textbf{e}$nt $\textbf{t}$ree sear$\textbf{ch}$ framework, which is a flexible, plug-and-play system compatible with various tree search algorithms. Our framework mitigates over-exploration by merging semantically similar states using agglomerative clustering of text embeddings obtained from a fine-tuned SimCSE model. To tackle under-exploration, we enhance verifiers by incorporating temporal difference learning with adjusted $λ$-returns during training to reduce variance, and employing a verifier ensemble to aggregate scores during inference. Experiments on GSM8K, GSM-Plus, and MATH datasets demonstrate that our methods significantly improve reasoning accuracy and computational efficiency across four different tree search algorithms, paving the way for more practical applications of LLM-based reasoning. The code is available at https://github.com/Soistesimmer/Fetch.

Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls

TL;DR

This work tackles inefficiencies in LLM reasoning that arise when tree search is guided by verifiers, identifying over-exploration from redundant states and under-exploration from high score variance. It introduces FETCH, a plug-in framework that combines semantic state merging via agglomerative clustering of embeddings (post-processed with signals from prompting or consistency checks) with variance reduction techniques: TD() training for verifiers and ensemble scoring at inference. Empirically, FETCH reduces token costs and boosts accuracy across BFS, Beam Search, Tree Search, and MCTS on GSM8K, GSM-Plus, and MATH datasets, with state merging cutting costs by up to ~3x in some cases and variance reduction providing consistent 1–2 point gains. The results underscore FETCH’s potential to make sophisticated LLM-based reasoning more efficient and scalable, enabling broader practical deployment of guided tree-search methods in complex domains.

Abstract

Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs), but at the cost of increased computational resources. In this work, we identify two key challenges contributing to this inefficiency: due to redundant states with semantically equivalent content, and caused by high variance in verifier scoring leading to frequent trajectory switching. To address these issues, we propose FETCH, an eficint ree sear framework, which is a flexible, plug-and-play system compatible with various tree search algorithms. Our framework mitigates over-exploration by merging semantically similar states using agglomerative clustering of text embeddings obtained from a fine-tuned SimCSE model. To tackle under-exploration, we enhance verifiers by incorporating temporal difference learning with adjusted -returns during training to reduce variance, and employing a verifier ensemble to aggregate scores during inference. Experiments on GSM8K, GSM-Plus, and MATH datasets demonstrate that our methods significantly improve reasoning accuracy and computational efficiency across four different tree search algorithms, paving the way for more practical applications of LLM-based reasoning. The code is available at https://github.com/Soistesimmer/Fetch.

Paper Structure

This paper contains 38 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The effect of Fetch using Best-First Search as baseline on GSM8K problems of varying difficulty is illustrated. Figure (a) shows that Fetch reduces computational costs and prioritizes resource allocation to harder tasks (Levels 3–4) over simpler ones (Levels 1–2), addressing both over-exploration in basic problems and under-exploration in complex cases (a). This leads to better performance, as shown in Figure (b).
  • Figure 2: Pilot experiment results using BFS on GSM8K, including: (a & b) Performances and costs when using different expansion size $N$, where the averaged number of generated token #Token ($k$) is used to estimate the computational cost. (c) The averaged similarity degree of $N$ sub-nodes from 1000 randomly selected non-leaf nodes, which affects the severity of over-exploration. (d) Standard deviation of verifier scores for 10 sampled correct trajectories from each question at different difficulty levels. High-variance scores can lead to under-exploration of promising nodes.
  • Figure 3: Illustration of redundant state merging. When new nodes are expanded, we merge semantically equivalent nodes into hyper-nodes using agglomerative clustering based on their embeddings.
  • Figure 4: Ablation studies on parameter selection of $f$ and $d$ for state merging.
  • Figure 5: Inference-time scaling for the BFS algorithm equipped with our methods. The expansion budget $N$ is set as $2,3,5,10$.
  • ...and 3 more figures