Table of Contents
Fetching ...

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, Omar Khattab

TL;DR

The paper addresses prompt optimization for multi-stage Language Model Programs (LMPs) under weak supervision (no module-level labels or gradients).It formalizes the optimization problem, identifies proposal and credit-assignment as core challenges, and proposes a design space with grounded proposal strategies and three credit-assignment methods.The authors introduce MIPRO, a Bayesian surrogate-based optimizer that jointly optimizes per-module prompts and demonstrations by separating proposal from credit assignment, and they show up to 13% accuracy gains on five of seven tasks using Llama-3-8B.A DSPy-based benchmark with seven diverse LM-program tasks is released to support future research on LM program optimization.

Abstract

Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel algorithm for optimizing LM programs. MIPRO outperforms baseline optimizers on five of seven diverse multi-stage LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 13% accuracy. We have released our new optimizers and benchmark in DSPy at http://dspy.ai

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

TL;DR

The paper addresses prompt optimization for multi-stage Language Model Programs (LMPs) under weak supervision (no module-level labels or gradients).It formalizes the optimization problem, identifies proposal and credit-assignment as core challenges, and proposes a design space with grounded proposal strategies and three credit-assignment methods.The authors introduce MIPRO, a Bayesian surrogate-based optimizer that jointly optimizes per-module prompts and demonstrations by separating proposal from credit assignment, and they show up to 13% accuracy gains on five of seven tasks using Llama-3-8B.A DSPy-based benchmark with seven diverse LM-program tasks is released to support future research on LM program optimization.

Abstract

Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel algorithm for optimizing LM programs. MIPRO outperforms baseline optimizers on five of seven diverse multi-stage LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 13% accuracy. We have released our new optimizers and benchmark in DSPy at http://dspy.ai
Paper Structure (58 sections, 2 equations, 13 figures, 10 tables, 1 algorithm)

This paper contains 58 sections, 2 equations, 13 figures, 10 tables, 1 algorithm.

Figures (13)

  • Figure 1: An example of the optimization problem we explore, shown for a multi-hop retrieval LM program. Given some question--answer pairs and a metric, the optimizer proposes new instructions and bootstraps new demonstrations (not pictured) for each stage.
  • Figure 2: Bootstrap Random Search. In Step 1, demonstrations are bootstrapped by running training inputs through the program $\Phi$ and keeping traces that produce sufficiently high scoring outputs, as judged by metric $\mu$. In Step 2, these bootstrapped demonstration sets are searched over using random search, and the most performant set is returned.
  • Figure 3: The Module-Level OPRO optimizer. A history of module-level instructions and program score pairs are given as input to the proposer LM to generate a new instruction for each module. These are then evaluated in the program, and the resulting score is added back with each module's instruction to the module's history. The process repeats for $I$ iterations.
  • Figure 4: The MIPRO optimizer. In Step 1, demonstrations are bootstrapped using the same process from Step 1 of Bootstrap Random Search. In Step 2, instructions are proposed using the grounding strategy described in \ref{['sec:prompt_proposal']}. In Step 3, Bayesian optimization is used to find the best performing combination of instruction and demonstration candidates.
  • Figure 5: Learned hyperparameter importances for ScoNe. Here we see that the Bayesian model learned the dataset summary, the tip, and the task demos in the prompt to be important to proposal quality.
  • ...and 8 more figures