Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

Sijia Chen; Baochun Li; Di Niu

Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

Sijia Chen, Baochun Li, Di Niu

TL;DR

Boosting of Thoughts (BoT) presents an automated, error-analysis-driven prompting framework that starts from a simple prompt and iteratively accumulates experience to refine thought generation. It uses a heterogeneous set of shallow weighted binary trees (per iteration) to explore reasoning steps, aggregates them into a single thought chain, and leverages LLM self-evaluation to produce revision guidance that is incorporated into the prompt over $T$ iterations with $M$ trees per iteration. By collecting and utilizing these reasoning experiences, BoT achieves competitive or state-of-the-art performance on several complex math benchmarks (e.g., GSM8K, AQuA) without manual annotations, and demonstrates strong gains on Game of 24, while showing some limits on harder datasets like MATH. The results indicate that error analysis and experience-driven prompt refinement can substantially enhance LLM reasoning, enabling scalable, task-agnostic improvement across diverse problems. BoT also reveals the critical role of model strength and the quality of feedback in determining gains from experience-driven prompting.

Abstract

The reasoning performance of Large Language Models (LLMs) on a wide range of problems critically relies on chain-of-thought prompting, which involves providing a few chain of thought demonstrations as exemplars in prompts. Recent work, e.g., Tree of Thoughts, has pointed out the importance of exploration and self-evaluation in reasoning step selection for complex problem solving. In this paper, we present Boosting of Thoughts (BoT), an automated prompting framework for problem solving with LLMs by iteratively exploring and self-evaluating many trees of thoughts in order to acquire an ensemble of trial-and-error reasoning experiences, which will serve as a new form of prompting to solve the complex problem. Starting from a simple prompt without requiring examples, BoT iteratively explores and evaluates a large collection of reasoning steps, and more importantly, uses error analysis obtained from the LLM on them to explicitly revise prompting, which in turn enhances reasoning step generation, until a final answer is attained. Our experiments with GPT-4 and Llama2 across extensive complex mathematical problems demonstrate that BoT consistently achieves higher or comparable problem-solving rates than other advanced prompting approaches.

Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

TL;DR

iterations with

trees per iteration. By collecting and utilizing these reasoning experiences, BoT achieves competitive or state-of-the-art performance on several complex math benchmarks (e.g., GSM8K, AQuA) without manual annotations, and demonstrates strong gains on Game of 24, while showing some limits on harder datasets like MATH. The results indicate that error analysis and experience-driven prompt refinement can substantially enhance LLM reasoning, enabling scalable, task-agnostic improvement across diverse problems. BoT also reveals the critical role of model strength and the quality of feedback in determining gains from experience-driven prompting.

Abstract

Paper Structure (23 sections, 5 figures, 18 tables, 2 algorithms)

This paper contains 23 sections, 5 figures, 18 tables, 2 algorithms.

Introduction
Related Work
Boosting of Thoughts
Background
Framework
Experiments
Main Results
Game of 24
Ablation Study
Concluding Remarks
Basic Prompts and Reasoning Pipeline of BoT
Thought generation part of BoT
Experience generation part of BoT
Reasoning Pipeline
Insights for Boosting of Thoughts
...and 8 more sections

Figures (5)

Figure 1: Boosting of thoughts iteratively enhances the prompt by adding experience, which comprises the analysis conducted by large language models (LLM or LM) on the generated thought chain. The experience specifically contains the thought chain itself, the corresponding error reports, and detailed advice on revising each reasoning step. Thus, those ineffective thoughts marked with a red cross can also contribute to prompt refinement. By accumulating experiences over iterations in the prompt, BoT can eventually yield a correct thought chain starting from a simple prompt. The examples presented here are extracted from results obtained by applying GPT-4 with BoT on the Game of 24 task.
Figure 2: The overview of the pipeline in each iteration of BoT. To show how boosting is achieved in this experience-driven iteration process, we present detailed intermediate results obtained from an experiment on ChapGPT-4 on the Game of 24 dataset. Given $Q:$ "The given four numbers are: 2, 4, 5, 5", BoT performs three stages sequentially. With the simple prompt ${\mathbb{I}}^t$ as input, The Thought Structures Generation (Stage 1) outputs massive heterogenous tree thought structures. Thought Structures Aggregation (Stage 2) aggregated them into a thought chain $\overline{z}_{1...n}$, which is analyzed in Stage 3 to produce experience to further enhance the prompt.
Figure 3: Evaluating solve rates by applying BoT and BoT+CoT in GPT-4 gpt4-arxiv23 and Llama2 llama2-arxiv23.
Figure 4: Comparison of three approaches across varying numbers of trees and iterations.
Figure 5: Solving rates on all the problems from different categories of the MATH dataset with different methods. The comparsion between these methods are performed on the categories, including PreAlgebra, Algebra, Counting & Probability, Number Theory, Geometry, Precalculus, and Intermediate Algebra, of the test set. The sub-figure with the 'Overall' shows the solving rate computed on all the problems across all categories.

Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

TL;DR

Abstract

Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)