Table of Contents
Fetching ...

Supervised Chain of Thought

Xiang Zhang, Dujian Ding

TL;DR

This work builds upon previous theoretical analyses of CoT to demonstrate how the one-prompt-for-all approach can negatively affect the computability of LLMs, and partitions the solution search space into two: the prompt space and the answer space.

Abstract

Large Language Models (LLMs) have revolutionized natural language processing and hold immense potential for advancing Artificial Intelligence. However, the core architecture of most mainstream LLMs -- the Transformer -- has inherent limitations in computational depth, rendering them theoretically incapable of solving many reasoning tasks that demand increasingly deep computations. Chain of Thought (CoT) prompting has emerged as a technique to address these architectural limitations, as evidenced by several theoretical studies. It offers a promising approach to solving complex reasoning tasks that were previously beyond the capabilities of these models. Despite its successes, CoT and its variants (such as Tree of Thought, Graph of Thought, etc.) rely on a "one-prompt-for-all" approach, using a single prompt structure (e.g., "think step by step") for a wide range of tasks -- from counting and sorting to solving mathematical and algorithmic problems. This approach poses significant challenges for models to generate the correct reasoning steps, as the model must navigate through a vast prompt template space to find the appropriate template for each task. In this work, we build upon previous theoretical analyses of CoT to demonstrate how the one-prompt-for-all approach can negatively affect the computability of LLMs. We partition the solution search space into two: the prompt space and the answer space. Our findings show that task-specific supervision is essential for navigating the prompt space accurately and achieving optimal performance. Through experiments with state-of-the-art LLMs, we reveal a gap in reasoning performance when supervision is applied versus when it is not.

Supervised Chain of Thought

TL;DR

This work builds upon previous theoretical analyses of CoT to demonstrate how the one-prompt-for-all approach can negatively affect the computability of LLMs, and partitions the solution search space into two: the prompt space and the answer space.

Abstract

Large Language Models (LLMs) have revolutionized natural language processing and hold immense potential for advancing Artificial Intelligence. However, the core architecture of most mainstream LLMs -- the Transformer -- has inherent limitations in computational depth, rendering them theoretically incapable of solving many reasoning tasks that demand increasingly deep computations. Chain of Thought (CoT) prompting has emerged as a technique to address these architectural limitations, as evidenced by several theoretical studies. It offers a promising approach to solving complex reasoning tasks that were previously beyond the capabilities of these models. Despite its successes, CoT and its variants (such as Tree of Thought, Graph of Thought, etc.) rely on a "one-prompt-for-all" approach, using a single prompt structure (e.g., "think step by step") for a wide range of tasks -- from counting and sorting to solving mathematical and algorithmic problems. This approach poses significant challenges for models to generate the correct reasoning steps, as the model must navigate through a vast prompt template space to find the appropriate template for each task. In this work, we build upon previous theoretical analyses of CoT to demonstrate how the one-prompt-for-all approach can negatively affect the computability of LLMs. We partition the solution search space into two: the prompt space and the answer space. Our findings show that task-specific supervision is essential for navigating the prompt space accurately and achieving optimal performance. Through experiments with state-of-the-art LLMs, we reveal a gap in reasoning performance when supervision is applied versus when it is not.

Paper Structure

This paper contains 19 sections, 1 equation, 10 figures, 3 tables.

Figures (10)

  • Figure 1: (a) Without supervision during CoT, the model generates its own step template for recurrent computation. This template can be incorrect, leading to task failure. (b) With human supervision, the task performance under CoT can be properly guided. (c) When CoT is not employed, the model relies solely on its internal reasoning via the Transformer architecture. (d) The Transformer can only perform constant-depth sequential computations. We assume that this Transformer neither memorizes the results nor performs bit-level (circuit) reasoning; instead, reasoning occurs at the neuron (hidden state) level.
  • Figure 2: Comparison between recurrence and autoregression.
  • Figure 3: A visualization of CoT search space, which decomposed into prompt space and answer space for a given problem.
  • Figure 4: ToT mechanism. $\mathbf{h}_\texttt{t}$ is transitioned into different $\mathbf{h}_\texttt{t+1}$, to explore more in answer space. How state is transitioned is dictated by the step template of CoT, which goes beyond what ToT offers.
  • Figure 5: Average success rate in deriving correct step template in each level of tasks.
  • ...and 5 more figures