Table of Contents
Fetching ...

Caterpillar of Thoughts: The Optimal Test-Time Algorithm for Large Language Models

Amir Azarmehr, Soheil Behnezhad, Alma Ghafari

Abstract

Large language models (LLMs) can often produce substantially better outputs when allowed to use additional test-time computation, such as sampling, chain of thought, backtracking, or revising partial solutions. Despite the growing empirical success of such techniques, there is limited theoretical understanding of how inference time computation should be structured, or what constitutes an optimal use of a fixed computation budget. We model test-time computation as an algorithm interacting with a Markov chain: at any point, the algorithm may resume generation from any previously observed state. That is, unlike standard Markov chains where the states are drawn passively, we allow the algorithm to backtrack to any previously observed state of the Markov chain at any time. Many of the existing test-time algorithms, such as Chain-of-Thought (CoT) (Wei et al., 2023), Tree-of-Thoughts (ToT) (Yao et al., 2023), or Best-of-$k$ (Brown et al., 2024) could be seen as specific algorithms in this model. We prove that while backtracking can reduce the number of generations exponentially, a very limited form of backtracking is theoretically sufficient. Namely, we show that the optimal algorithm always generates a caterpillar tree. That is, if we remove the leaves of the state tree generated by the optimal algorithm, we obtain a path. Motivated by our characterization of the optimal algorithm, we present Caterpillar of Thoughts (CaT), a new test-time computation algorithm, reducing the number of token/state generations. Our empirical evaluation shows that CaT, compared to ToT, achieves a better success rate while also reducing the number of token generations.

Caterpillar of Thoughts: The Optimal Test-Time Algorithm for Large Language Models

Abstract

Large language models (LLMs) can often produce substantially better outputs when allowed to use additional test-time computation, such as sampling, chain of thought, backtracking, or revising partial solutions. Despite the growing empirical success of such techniques, there is limited theoretical understanding of how inference time computation should be structured, or what constitutes an optimal use of a fixed computation budget. We model test-time computation as an algorithm interacting with a Markov chain: at any point, the algorithm may resume generation from any previously observed state. That is, unlike standard Markov chains where the states are drawn passively, we allow the algorithm to backtrack to any previously observed state of the Markov chain at any time. Many of the existing test-time algorithms, such as Chain-of-Thought (CoT) (Wei et al., 2023), Tree-of-Thoughts (ToT) (Yao et al., 2023), or Best-of- (Brown et al., 2024) could be seen as specific algorithms in this model. We prove that while backtracking can reduce the number of generations exponentially, a very limited form of backtracking is theoretically sufficient. Namely, we show that the optimal algorithm always generates a caterpillar tree. That is, if we remove the leaves of the state tree generated by the optimal algorithm, we obtain a path. Motivated by our characterization of the optimal algorithm, we present Caterpillar of Thoughts (CaT), a new test-time computation algorithm, reducing the number of token/state generations. Our empirical evaluation shows that CaT, compared to ToT, achieves a better success rate while also reducing the number of token generations.
Paper Structure (21 sections, 23 theorems, 37 equations, 3 figures, 2 tables, 4 algorithms)

This paper contains 21 sections, 23 theorems, 37 equations, 3 figures, 2 tables, 4 algorithms.

Key Result

Theorem 1.1

Given an initial state $x_0$, alg:opt reaches the target state using the optimal number of steps in expectation, i.e., in $\mathsf{OPT}(x_0)$ steps.

Figures (3)

  • Figure 1: A Markov chain where rewinding is crucial, consisting of $n+1$ states $x_0 \rightarrow x_1 \rightarrow \ldots \rightarrow x_{n}$ on a path, and one dummy state $D$. The arrows denote the transition probabilities. The first state on the path, $x_0$, is the starting point of the algorithm, and the last state, $x_{n}$, is the target state.
  • Figure 2: This figure illustrates the tree explored by two different algorithms, with \ref{['fig:dummy-chain']} being the input Markov chain. White circles represent state $D$, and green states represent the optimal path to the target state. The right side tree represents the explored tree by CaT, and the left side tree represents a BFS-like algorithm that expands without pruning the dummy state.
  • Figure 3: Illustration shows log linear relation between number of valid expressions and number of steps.

Theorems & Definitions (42)

  • Theorem 1.1
  • Definition 2.1: Rewinding Algorithms
  • Definition 2.2: The Optimal Algorithm
  • Lemma 3.0
  • Definition 3.1: Non-branching Algorithms
  • Theorem 3.2
  • proof
  • Corollary 3.3
  • proof
  • Theorem 3.3
  • ...and 32 more