Table of Contents
Fetching ...

Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search

Xinzhe Li

TL;DR

Chain-in-Tree (CiT) introduces a chaining phase in LLM-in-the-loop tree search to avoid unnecessary branching by evaluating Branching Necessity (BN) before expanding nodes. It presents two BN evaluators, BN-DP and BN-SC, with BN-DP provably not increasing policy invocations, and demonstrates CiT’s compatibility across ToT-BS, ReST-MCTS, and RAP on GSM8K and Math500, achieving 75–85% reductions in runtime and token usage with negligible accuracy loss. BN-SC offers substantial savings in many settings but can be unstable in a subset of cases, highlighting the importance of BN evaluator quality. The authors provide theoretical efficiency guarantees, release modular, cross-framework CiT code, and discuss reproducibility and limitations, underscoring CiT’s practical impact for scaling LLM-based search in mathematical and planning tasks.

Abstract

Test-time scaling improves large language models (LLMs) on long-horizon reasoning tasks by allocating more compute at inference. LLM Inference via Tree Search (LITS) methods achieve strong performance but are highly inefficient, often running an order of magnitude slower than iterative approaches. We propose Chain-in-Tree (CiT), a plug-in framework that decides when to branch during search rather than expanding at every step. CiT introduces lightweight Branching Necessity (BN) evaluations: BN-DP (Direct Prompting), where an auxiliary LLM judges branching needs, and BN-SC (Self-Consistency), which clusters candidate actions to assess agreement. Integrated into Tree of Thoughts, ReST-MCTS, and RAP, BN-DP achieves 75-85% reductions in token generation, model calls, and runtime on GSM8K and Math500, with often negligible or no accuracy loss. BN-SC typically yields substantial savings (up to 80%) generally but shows instability in 1-4 out of 14 settings, caused by a small subset of examples that produce extremely long reasoning steps. We theoretically prove that BN-DP never increases policy invocations and release both modular LITS implementations and a lightweight CiT function applicable across all LITS variants. The full codebase is publicly available at https://github.com/xinzhel/chain_in_tree.

Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search

TL;DR

Chain-in-Tree (CiT) introduces a chaining phase in LLM-in-the-loop tree search to avoid unnecessary branching by evaluating Branching Necessity (BN) before expanding nodes. It presents two BN evaluators, BN-DP and BN-SC, with BN-DP provably not increasing policy invocations, and demonstrates CiT’s compatibility across ToT-BS, ReST-MCTS, and RAP on GSM8K and Math500, achieving 75–85% reductions in runtime and token usage with negligible accuracy loss. BN-SC offers substantial savings in many settings but can be unstable in a subset of cases, highlighting the importance of BN evaluator quality. The authors provide theoretical efficiency guarantees, release modular, cross-framework CiT code, and discuss reproducibility and limitations, underscoring CiT’s practical impact for scaling LLM-based search in mathematical and planning tasks.

Abstract

Test-time scaling improves large language models (LLMs) on long-horizon reasoning tasks by allocating more compute at inference. LLM Inference via Tree Search (LITS) methods achieve strong performance but are highly inefficient, often running an order of magnitude slower than iterative approaches. We propose Chain-in-Tree (CiT), a plug-in framework that decides when to branch during search rather than expanding at every step. CiT introduces lightweight Branching Necessity (BN) evaluations: BN-DP (Direct Prompting), where an auxiliary LLM judges branching needs, and BN-SC (Self-Consistency), which clusters candidate actions to assess agreement. Integrated into Tree of Thoughts, ReST-MCTS, and RAP, BN-DP achieves 75-85% reductions in token generation, model calls, and runtime on GSM8K and Math500, with often negligible or no accuracy loss. BN-SC typically yields substantial savings (up to 80%) generally but shows instability in 1-4 out of 14 settings, caused by a small subset of examples that produce extremely long reasoning steps. We theoretically prove that BN-DP never increases policy invocations and release both modular LITS implementations and a lightweight CiT function applicable across all LITS variants. The full codebase is publicly available at https://github.com/xinzhel/chain_in_tree.

Paper Structure

This paper contains 55 sections, 14 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Policy invocations in Original ToT-BS vs. ToT-BS + Chaining (BN-SC). Numbers with $\prime$ indicate beam search sampling size; those with $*$ indicate BN judge sampling size. The BN-DP variant is shown in Appendix \ref{['app:theoretical_costs']}.
  • Figure 2: Accuracy comparison under CiT plug-in across two search frameworks (ReST-MCTS* and ToT-BS) and two datasets (GSM8K, Math500). Bars show BN evaluator quality (Poor BN: LLaMA-3-8B vs Accurate BN: Qwen-3-32B). Horizontal lines denote baselines (CoT, ReST, ToT-BS).
  • Figure 3: The number of policy invocations of Original ToT-BS vs ToT-BS + Chaining. The numbers suffixed by "$\prime$" indicate sampling size for beam search expansion, while The numbers suffixed by "*" indicate sampling size for BN judge.
  • Figure 4: Instance-level analysis of a failure case on Math500 with LLaMA+Qwen (policy = LLaMA, BN = Qwen). Each point shows the relative change in the number of invocations compared to the baseline; negative values indicate higher efficiency (fewer output tokens required). Filled markers correspond to correct predictions, while hollow markers correspond to incorrect ones.
  • Figure 5: Instance-level analysis of failure for Math500 with LLaMA+LLaMA (BN-SC$^1$).
  • ...and 2 more figures