Table of Contents
Fetching ...

On the Empirical Complexity of Reasoning and Planning in LLMs

Liwei Kang, Zirui Zhao, David Hsu, Wee Sun Lee

TL;DR

The paper links the empirical success of chain-of-thought and tree-of-thought prompting in LLMs to sample and computational complexity principles, using six case studies (e.g., GSM8K, MusiQue, MWIS, air travel planning, Game of 24, Blocksworld) to derive general guidelines. It formalizes reasoning as a planning problem with DL/MDL analyses for Direct, CoT, ToT, and CoT-SC, showing that problem decomposition reduces sample needs while tree-based ToT helps mainly when the task requires substantial search. Across tasks, explicit annotation of relevant variables and a decomposition-driven approach improve performance, while ToT-Decomp often outperforms plain ToT or CoT on harder problems. The findings offer practical implications for when to deploy CoT versus ToT and how to structure prompts or fine-tuning to align with the underlying computational structure of the task. Limitations include the use of tabular DL abstractions, the role of pretraining, and potential domain shifts, with an Ethics statement addressing responsible use and potential misuses of reasoning capabilities.

Abstract

Chain-of-thought (CoT), tree-of-thought (ToT), and related techniques work surprisingly well in practice for some complex reasoning tasks with Large Language Models (LLMs), but why? This work seeks the underlying reasons by conducting experimental case studies and linking the performance benefits to well-established sample and computational complexity principles in machine learning. We experimented with 6 reasoning tasks, ranging from grade school math, air travel planning, ..., to Blocksworld. The results suggest that (i) both CoT and ToT benefit significantly from task decomposition, which breaks a complex reasoning task into a sequence of steps with low sample complexity and explicitly outlines the reasoning structure, and (ii) for computationally hard reasoning tasks, the more sophisticated tree structure of ToT outperforms the linear structure of CoT. These findings provide useful guidelines for the use of LLM in solving reasoning tasks in practice.

On the Empirical Complexity of Reasoning and Planning in LLMs

TL;DR

The paper links the empirical success of chain-of-thought and tree-of-thought prompting in LLMs to sample and computational complexity principles, using six case studies (e.g., GSM8K, MusiQue, MWIS, air travel planning, Game of 24, Blocksworld) to derive general guidelines. It formalizes reasoning as a planning problem with DL/MDL analyses for Direct, CoT, ToT, and CoT-SC, showing that problem decomposition reduces sample needs while tree-based ToT helps mainly when the task requires substantial search. Across tasks, explicit annotation of relevant variables and a decomposition-driven approach improve performance, while ToT-Decomp often outperforms plain ToT or CoT on harder problems. The findings offer practical implications for when to deploy CoT versus ToT and how to structure prompts or fine-tuning to align with the underlying computational structure of the task. Limitations include the use of tabular DL abstractions, the role of pretraining, and potential domain shifts, with an Ethics statement addressing responsible use and potential misuses of reasoning capabilities.

Abstract

Chain-of-thought (CoT), tree-of-thought (ToT), and related techniques work surprisingly well in practice for some complex reasoning tasks with Large Language Models (LLMs), but why? This work seeks the underlying reasons by conducting experimental case studies and linking the performance benefits to well-established sample and computational complexity principles in machine learning. We experimented with 6 reasoning tasks, ranging from grade school math, air travel planning, ..., to Blocksworld. The results suggest that (i) both CoT and ToT benefit significantly from task decomposition, which breaks a complex reasoning task into a sequence of steps with low sample complexity and explicitly outlines the reasoning structure, and (ii) for computationally hard reasoning tasks, the more sophisticated tree structure of ToT outperforms the linear structure of CoT. These findings provide useful guidelines for the use of LLM in solving reasoning tasks in practice.
Paper Structure (79 sections, 1 theorem, 11 figures, 15 tables, 2 algorithms)

This paper contains 79 sections, 1 theorem, 11 figures, 15 tables, 2 algorithms.

Key Result

Theorem 3.1

Let $\mathcal{H}$ be a hypothesis class and let $d\colon\mathcal{H} \rightarrow \{0, 1\}^*$ be a prefix-free description language for $\mathcal{H}$. Then, for every sample size, $m$, every confidence parameter, $\delta > 0$, and every probability distribution, $D$, with probability greater than $1-\

Figures (11)

  • Figure 1: An illustration of LLM reasoning methods on the Game of 24. Given four poker cards, the player combines the cards using basic arithmetic operations, $(+, -, \times, \div)$, to reach the target number of 24.
  • Figure 2: (a) Results of GPT-3.5 and GPT-4 on GSM8K Test set; (b) Fine-tuning results on Llama2-7b
  • Figure 3: F1 score of GPT-3.5 and GPT-4 on MusiQue Dev set. (a) is using natural language context, (b) is using LLM parsed relation triplets as context.
  • Figure 4: In-context learning results on MWIS. 3-shot prompts have one example each for sizes 4, 5, and 6, while 6-shot prompts have two examples each. "In-domain" refers sizes 4, 5, and 6, and "Out-of-Domain" refers to sizes from 6 to 10.
  • Figure 5: Results of fine-tuning Llama2-7B-chat on MWIS.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Theorem 3.1: Occam's Razor