Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models
Sijia Chen, Baochun Li, Di Niu
TL;DR
Boosting of Thoughts (BoT) presents an automated, error-analysis-driven prompting framework that starts from a simple prompt and iteratively accumulates experience to refine thought generation. It uses a heterogeneous set of shallow weighted binary trees (per iteration) to explore reasoning steps, aggregates them into a single thought chain, and leverages LLM self-evaluation to produce revision guidance that is incorporated into the prompt over $T$ iterations with $M$ trees per iteration. By collecting and utilizing these reasoning experiences, BoT achieves competitive or state-of-the-art performance on several complex math benchmarks (e.g., GSM8K, AQuA) without manual annotations, and demonstrates strong gains on Game of 24, while showing some limits on harder datasets like MATH. The results indicate that error analysis and experience-driven prompt refinement can substantially enhance LLM reasoning, enabling scalable, task-agnostic improvement across diverse problems. BoT also reveals the critical role of model strength and the quality of feedback in determining gains from experience-driven prompting.
Abstract
The reasoning performance of Large Language Models (LLMs) on a wide range of problems critically relies on chain-of-thought prompting, which involves providing a few chain of thought demonstrations as exemplars in prompts. Recent work, e.g., Tree of Thoughts, has pointed out the importance of exploration and self-evaluation in reasoning step selection for complex problem solving. In this paper, we present Boosting of Thoughts (BoT), an automated prompting framework for problem solving with LLMs by iteratively exploring and self-evaluating many trees of thoughts in order to acquire an ensemble of trial-and-error reasoning experiences, which will serve as a new form of prompting to solve the complex problem. Starting from a simple prompt without requiring examples, BoT iteratively explores and evaluates a large collection of reasoning steps, and more importantly, uses error analysis obtained from the LLM on them to explicitly revise prompting, which in turn enhances reasoning step generation, until a final answer is attained. Our experiments with GPT-4 and Llama2 across extensive complex mathematical problems demonstrate that BoT consistently achieves higher or comparable problem-solving rates than other advanced prompting approaches.
