Table of Contents
Fetching ...

Structured Prompting Enables More Robust Evaluation of Language Models

Asad Aali, Muhammad Ahmed Mohsin, Vasiliki Bikia, Arnav Singhvi, Richard Gaus, Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Yifan Mai, Jordan Cahoon, Michael Pfeffer, Roxana Daneshjou, Sanmi Koyejo, Emily Alsentzer, Christopher Potts, Nigam H. Shah, Akshay S. Chaudhari

TL;DR

The paper tackles the problem of underestimating language-model capabilities when using fixed prompts by proposing a structured prompting framework that estimates performance ceilings. By integrating DSPy with HELM (DSPy+HELM), it systematically evaluates four prompting methods across seven HELM benchmarks using four frontier LMs, comparing against HELM baselines. Key findings show structured prompting yields about a 4% absolute accuracy gain on average, reduces cross-benchmark variance, and can flip leaderboard standings on several tasks; chain-of-thought prompting, in particular, reduces sensitivity to prompt design and is the most cost-effective. The work demonstrates that scalable, automated performance-ceiling approximation leads to more robust, decision-useful benchmarks and provides open-source pipelines for replication and extension.

Abstract

As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we approximate each LM's ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks ($+$2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing chain-of-thought reduces LM sensitivity to prompt design (smaller $Δ$ across prompts). To our knowledge, this is the first benchmarking study to systematically integrate structured prompting into an established evaluation framework, demonstrating how scalable performance-ceiling approximation yields more robust, decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).

Structured Prompting Enables More Robust Evaluation of Language Models

TL;DR

The paper tackles the problem of underestimating language-model capabilities when using fixed prompts by proposing a structured prompting framework that estimates performance ceilings. By integrating DSPy with HELM (DSPy+HELM), it systematically evaluates four prompting methods across seven HELM benchmarks using four frontier LMs, comparing against HELM baselines. Key findings show structured prompting yields about a 4% absolute accuracy gain on average, reduces cross-benchmark variance, and can flip leaderboard standings on several tasks; chain-of-thought prompting, in particular, reduces sensitivity to prompt design and is the most cost-effective. The work demonstrates that scalable, automated performance-ceiling approximation leads to more robust, decision-useful benchmarks and provides open-source pipelines for replication and extension.

Abstract

As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we approximate each LM's ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks (2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing chain-of-thought reduces LM sensitivity to prompt design (smaller across prompts). To our knowledge, this is the first benchmarking study to systematically integrate structured prompting into an established evaluation framework, demonstrating how scalable performance-ceiling approximation yields more robust, decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).

Paper Structure

This paper contains 21 sections, 17 equations, 4 figures, 5 tables, 2 algorithms.

Figures (4)

  • Figure 1: Pipeline overview. (a) DSPy takes HELM's baseline prompt and produces structured prompt variants. (b) HELM evaluates models under each prompt variant. With structured prompting, we observe more robust evaluation: (i) improved performance, (ii) reduced variance, (iii) altered gaps (flipped rankings).
  • Figure 2: Structured prompting methods evaluated in our study (Zero-Shot CoT, BFRS, MIPROv2). Each box corresponds to one method, showing how instructions and context differ across methods. For BFRS and MIPROv2, $K$ denotes the number of in-context demonstrations (Inputs $\rightarrow$ Reasoning, Output).
  • Figure 3: Heat map showing $\Delta$ (increase in accuracy) of each prompting method over HELM's baseline (light=small, dark=large). Across four models, x-axis lists prompting methods, y-axis lists benchmarks. All structured prompting methods exhibit similar improvements, while o3 Mini remains relatively insensitive.
  • Figure 4: Accuracy vs cost tradeoff across prompting methods. Each point represents a model-prompt pair, with x-axis showing additional prompt tokens (relative to HELM baseline) and y-axis showing macro-averaged accuracy across benchmarks. Overall, Zero-Shot CoT is the most cost-effective structured prompting method.