Table of Contents
Fetching ...

Prompt Optimization as a State-Space Search Problem

Maanas Taneja

TL;DR

The paper reframes prompt optimization as a state-space search where prompts are nodes and transformations are edges, and evaluates classical search methods (beam search and random walk) on a set of prompt mutations. By using GPT-4o to generate prompts and GPT-5 as a critic for complex tasks, the approach demonstrates dev-set performance gains across five NLP tasks, while revealing a consistent gap between development and test generalization due to overfitting and evaluator noise. Key findings show that concise prompts and adding example demonstrations frequently drive improvements, while verbose prompts are rarely beneficial, and beam search generally outperforms other strategies under tight compute budgets. The work highlights the viability of search-based prompt optimization, calls for deeper and broader exploration with improved evaluation metrics, and discusses practical considerations such as data realism, operator diversity, and cost, outlining a path toward more robust, generalizable prompt design.

Abstract

Language Models are extremely susceptible to performance collapse with even small changes to input prompt strings. Libraries such as DSpy (from Stanford NLP) avoid this problem through demonstration-based prompt optimisation. Inspired by this, I propose an alternative approach that treats prompt optimisation as a classical state-space search problem. I model the prompt space as a graph where nodes represent prompt states and edges correspond to deliberate transformations such as shortening, adding examples, or re- ordering content. Using beam search and random walk algorithms, I systematically explore this space, evaluating candidates on development sets and pruning unpromising branches. Across five NLP tasks (sentiment classification, question answering, summarisation, reason- ing, and natural language inference), I find that even shallow search configurations (beam width=2, depth=2) improve upon seed prompts on development sets. For instance, beam search achieves development accuracy gains from 0.40 to 0.80 on reasoning tasks, though test set improvements are more modest (0.20 to 0.50), indicating overfitting to the develop- ment heuristic. Analysis of successful optimisation paths reveals that transformations that make prompts concise appear most frequently, while verbosity operators are never selected. My results validate prompt optimization as a search problem and suggest that with greater computational resources and improved evaluation metrics, deeper exploration could yield more robust prompts that generalize beyond development sets. Code and implementation are available at [https://github.com/MaanasTaneja/PromptOptimiser].

Prompt Optimization as a State-Space Search Problem

TL;DR

The paper reframes prompt optimization as a state-space search where prompts are nodes and transformations are edges, and evaluates classical search methods (beam search and random walk) on a set of prompt mutations. By using GPT-4o to generate prompts and GPT-5 as a critic for complex tasks, the approach demonstrates dev-set performance gains across five NLP tasks, while revealing a consistent gap between development and test generalization due to overfitting and evaluator noise. Key findings show that concise prompts and adding example demonstrations frequently drive improvements, while verbose prompts are rarely beneficial, and beam search generally outperforms other strategies under tight compute budgets. The work highlights the viability of search-based prompt optimization, calls for deeper and broader exploration with improved evaluation metrics, and discusses practical considerations such as data realism, operator diversity, and cost, outlining a path toward more robust, generalizable prompt design.

Abstract

Language Models are extremely susceptible to performance collapse with even small changes to input prompt strings. Libraries such as DSpy (from Stanford NLP) avoid this problem through demonstration-based prompt optimisation. Inspired by this, I propose an alternative approach that treats prompt optimisation as a classical state-space search problem. I model the prompt space as a graph where nodes represent prompt states and edges correspond to deliberate transformations such as shortening, adding examples, or re- ordering content. Using beam search and random walk algorithms, I systematically explore this space, evaluating candidates on development sets and pruning unpromising branches. Across five NLP tasks (sentiment classification, question answering, summarisation, reason- ing, and natural language inference), I find that even shallow search configurations (beam width=2, depth=2) improve upon seed prompts on development sets. For instance, beam search achieves development accuracy gains from 0.40 to 0.80 on reasoning tasks, though test set improvements are more modest (0.20 to 0.50), indicating overfitting to the develop- ment heuristic. Analysis of successful optimisation paths reveals that transformations that make prompts concise appear most frequently, while verbosity operators are never selected. My results validate prompt optimization as a search problem and suggest that with greater computational resources and improved evaluation metrics, deeper exploration could yield more robust prompts that generalize beyond development sets. Code and implementation are available at [https://github.com/MaanasTaneja/PromptOptimiser].

Paper Structure

This paper contains 59 sections, 9 equations, 5 tables, 5 algorithms.