Table of Contents
Fetching ...

STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts

Zachary Bamberger, Till R. Saenger, Gilad Morad, Ofra Amir, Brandon M. Stewart, Amir Feder

TL;DR

STATe addresses the need for controllable and interpretable reasoning in inference-time computation for large language models by introducing discrete action templates in a Tree-of-Thoughts framework. The approach uses a Plan→Generate→Evaluate→Select loop with a controller, generator, and evaluator to produce diverse yet high-quality outputs, and demonstrates benefits on NoveltyBench and an argument-generation case study. The key contributions are (1) a controllable action-space search that improves diversity without sacrificing quality, (2) an attribution framework linking action traces to output quality to identify promising strategies, and (3) a method to steer generation toward unexplored but high-potential regions via targeted trajectory exploration. The work provides a practical, interpretable framework for designing diverse, high-quality, and explainable text generation, with publicly available code.

Abstract

Inference-Time-Compute (ITC) methods like Best-of-N and Tree-of-Thoughts are meant to produce output candidates that are both high-quality and diverse, but their use of high-temperature sampling often fails to achieve meaningful output diversity. Moreover, existing ITC methods offer limited control over how to perform reasoning, which in turn limits their explainability. We present STATe-of-Thoughts (STATe), an interpretable ITC method that searches over high-level reasoning patterns. STATe replaces stochastic sampling with discrete and interpretable textual interventions: a controller selects actions encoding high-level reasoning choices, a generator produces reasoning steps conditioned on those choices, and an evaluator scores candidates to guide search. This structured approach yields three main advantages. First, action-guided textual interventions produce greater response diversity than temperature-based sampling. Second, in a case study on argument generation, STATe's explicit action sequences capture interpretable features that are highly predictive of output quality. Third, estimating the association between performance and action choices allows us to identify promising yet unexplored regions of the action space and steer generation directly toward them. Together, these results establish STATe as a practical framework for generating high-quality, diverse, and interpretable text. Our framework is available at https://github.com/zbambergerNLP/state-of-thoughts.

STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts

TL;DR

STATe addresses the need for controllable and interpretable reasoning in inference-time computation for large language models by introducing discrete action templates in a Tree-of-Thoughts framework. The approach uses a Plan→Generate→Evaluate→Select loop with a controller, generator, and evaluator to produce diverse yet high-quality outputs, and demonstrates benefits on NoveltyBench and an argument-generation case study. The key contributions are (1) a controllable action-space search that improves diversity without sacrificing quality, (2) an attribution framework linking action traces to output quality to identify promising strategies, and (3) a method to steer generation toward unexplored but high-potential regions via targeted trajectory exploration. The work provides a practical, interpretable framework for designing diverse, high-quality, and explainable text generation, with publicly available code.

Abstract

Inference-Time-Compute (ITC) methods like Best-of-N and Tree-of-Thoughts are meant to produce output candidates that are both high-quality and diverse, but their use of high-temperature sampling often fails to achieve meaningful output diversity. Moreover, existing ITC methods offer limited control over how to perform reasoning, which in turn limits their explainability. We present STATe-of-Thoughts (STATe), an interpretable ITC method that searches over high-level reasoning patterns. STATe replaces stochastic sampling with discrete and interpretable textual interventions: a controller selects actions encoding high-level reasoning choices, a generator produces reasoning steps conditioned on those choices, and an evaluator scores candidates to guide search. This structured approach yields three main advantages. First, action-guided textual interventions produce greater response diversity than temperature-based sampling. Second, in a case study on argument generation, STATe's explicit action sequences capture interpretable features that are highly predictive of output quality. Third, estimating the association between performance and action choices allows us to identify promising yet unexplored regions of the action space and steer generation directly toward them. Together, these results establish STATe as a practical framework for generating high-quality, diverse, and interpretable text. Our framework is available at https://github.com/zbambergerNLP/state-of-thoughts.
Paper Structure (71 sections, 14 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 71 sections, 14 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: STATe for argument generation. Tasked with generating persuasive arguments in favor of banning single-use plastics, STATe's workflow involves the following steps: (1) Define action templates that control output features of interest, such as structural prefixes and content themes. (2) Generate outputs via tree search (Grey nodes indicate pruned branches; the rightmost path illustrates early stopping after a single step). (3) Evaluate outputs on a downstream metric, and study associations between action choices and performance.
  • Figure 2: Stylized example of STATe's Plan$\rightarrow$Generate$\rightarrow$Evaluate$\rightarrow$Select loop. The controller plans which actions to explore, the generator expands candidate trajectories, the evaluator scores them, and beam selection retains the top-$k$ states.
  • Figure 3: Predictability of argument quality from controller actions. (A) Cross-validation R$^2$ across model variants: M1a (structure presence only), M1b (content presence only), M1c (both), and M2 (full sequential model with position effects and transitions). (B) Held-out test set performance with 95% bootstrap confidence intervals. The sequential model consistently outperforms presence-based baselines, with the largest absolute predictability under strict synthesis.
  • Figure 4: Predictability of argument quality from controller actions and argument length. (A) Cross-validation R$^2$ across model variants: M0 (argument length), M1a (structure presence only + argument length), M1b (content presence only + argument length), M1c (both + argument length), and M2 (full sequential model with position effects and transitions + argument length). (B) Held-out test set performance with 95% bootstrap confidence intervals. The sequential model consistently outperforms presence-based baselines, with the largest absolute predictability under strict synthesis.
  • Figure 5: Distribution of argument lengths (characters) across synthesis modes. Strict and restructured synthesis produce relatively unimodal distributions, while faithful synthesis exhibits higher variance with a bimodal tendency---some arguments remain compact while others include additional concluding material. We removed exact duplicates, reducing the number of samples to N=4,994 for strict synthesis.
  • ...and 7 more figures