Table of Contents
Fetching ...

Why Prompt Design Matters and Works: A Complexity Analysis of Prompt Search Space in LLMs

Xiang Zhang, Juntai Cao, Jiaqi Wei, Chenyu You, Dujian Ding

TL;DR

The paper presents a theoretical framework showing that prompts act as selectors that govern which task-relevant information is extracted from a model's hidden state during Chain-of-Thought reasoning. It decomposes CoT into prompt-space and answer-space components, analyzes their complexities, and demonstrates that optimally supervised prompts dramatically improve reasoning performance compared with naive unsupervised CoT. Through controlled experiments across basic to intermediate reasoning tasks, it provides empirical evidence that recurrence-like computation via well-designed prompts is crucial for tackling deeper tasks. The work offers a principled foundation for prompt engineering and practical guidance for designing task-specific prompts and evaluation strategies.

Abstract

Despite the remarkable successes of large language models (LLMs), the underlying Transformer architecture has inherent limitations in handling complex reasoning tasks. Chain-of-thought (CoT) prompting has emerged as a practical workaround, but most CoT-based methods rely on a single, generic prompt such as "think step by step", with no task-specific adaptation. These approaches expect the model to discover an effective reasoning path on its own, forcing it to search through a vast prompt space. In contrast, several studies have explored task-specific prompt designs to boost performance. However, these designs are typically developed through trial and error, lacking theoretical grounding. As a result, prompt engineering remains largely ad hoc and unguided. In this paper, we provide a theoretical framework that explains why some prompts succeed while others fail. We show that prompts function as selectors, extracting task-relevant information from the model's full hidden state during CoT reasoning. Each prompt defines a unique trajectory through the answer space, and the choice of trajectory is crucial for task performance and future navigation within the space. We analyze the complexity of finding optimal prompts and characterize the size of the prompt space for a given task. Our theory reveals principles behind effective prompt design and shows that naive CoT-using self-guided prompts like "think step by step"-can severely hinder performance. Through experiments, we show that optimal prompt search can lead to more than a 50% improvement on reasoning tasks, providing a theoretical foundation for prompt engineering.

Why Prompt Design Matters and Works: A Complexity Analysis of Prompt Search Space in LLMs

TL;DR

The paper presents a theoretical framework showing that prompts act as selectors that govern which task-relevant information is extracted from a model's hidden state during Chain-of-Thought reasoning. It decomposes CoT into prompt-space and answer-space components, analyzes their complexities, and demonstrates that optimally supervised prompts dramatically improve reasoning performance compared with naive unsupervised CoT. Through controlled experiments across basic to intermediate reasoning tasks, it provides empirical evidence that recurrence-like computation via well-designed prompts is crucial for tackling deeper tasks. The work offers a principled foundation for prompt engineering and practical guidance for designing task-specific prompts and evaluation strategies.

Abstract

Despite the remarkable successes of large language models (LLMs), the underlying Transformer architecture has inherent limitations in handling complex reasoning tasks. Chain-of-thought (CoT) prompting has emerged as a practical workaround, but most CoT-based methods rely on a single, generic prompt such as "think step by step", with no task-specific adaptation. These approaches expect the model to discover an effective reasoning path on its own, forcing it to search through a vast prompt space. In contrast, several studies have explored task-specific prompt designs to boost performance. However, these designs are typically developed through trial and error, lacking theoretical grounding. As a result, prompt engineering remains largely ad hoc and unguided. In this paper, we provide a theoretical framework that explains why some prompts succeed while others fail. We show that prompts function as selectors, extracting task-relevant information from the model's full hidden state during CoT reasoning. Each prompt defines a unique trajectory through the answer space, and the choice of trajectory is crucial for task performance and future navigation within the space. We analyze the complexity of finding optimal prompts and characterize the size of the prompt space for a given task. Our theory reveals principles behind effective prompt design and shows that naive CoT-using self-guided prompts like "think step by step"-can severely hinder performance. Through experiments, we show that optimal prompt search can lead to more than a 50% improvement on reasoning tasks, providing a theoretical foundation for prompt engineering.

Paper Structure

This paper contains 31 sections, 2 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Prompt templates influence the configuration and navigation of the answer space. Prompt space search techniques include methods like Auto-Prompt shin2020autopromptelicitingknowledgelanguage, RL-Prompt deng2022rlpromptoptimizingdiscretetext While answer space searching includes ToT yao2024tree, ReST-MCTS$^*$zhang2024rest
  • Figure 2: (a) Naive CoT, the model generates its own step template for recurrent computation. This prompt template can be incorrect or suboptimal, leading to task failure. (b) With prompt engineering and design, the task performance under CoT can be properly guided. (c) When CoT is not employed, the model relies solely on its internal reasoning via the Transformer architecture. (d) Transformer can only perform constant-depth sequential computations.
  • Figure 3: How CoT approximates recurrent computation.
  • Figure 4: Computational Information flow in Recurrent and Autoregressive models
  • Figure 5: CoT only extract partial information from $\mathbf{h}$. Which information to extract is dictated by prompt template.
  • ...and 12 more figures