Table of Contents
Fetching ...

PromptPex: Automatic Test Generation for Language Model Prompts

Reshabh K Sharma, Jonathan De Halleux, Shraddha Barke, Benjamin Zorn

TL;DR

PromptPex introduces an automated framework to extract input specifications and output rules from a prompt, then generates and evaluates tests across multiple models to detect noncompliant outputs. The approach treats prompts as software-like artifacts with pre and postconditions, enabling systematic testing and model migration insights. Eight benchmark prompts and cross-model experiments demonstrate that specification-driven tests outperform a baseline in exposing prompt weaknesses and differentiating model capabilities. This work provides tangible artifacts for prompt engineers to reason about behavior across models and time, aiding robustness and migration of prompts to newer models.

Abstract

Large language models (LLMs) are being used in many applications and prompts for these models are integrated into software applications as code-like artifacts. These prompts behave much like traditional software in that they take inputs, generate outputs, and perform some specific function. However, prompts differ from traditional code in many ways and require new approaches to ensure that they are robust. For example, unlike traditional software the output of a prompt depends on the AI model that interprets it. Also, while natural language prompts are easy to modify, the impact of updates is harder to predict. New approaches to testing, debugging, and modifying prompts with respect to the model running them are required. To address some of these issues, we developed PromptPex, an LLM-based tool to automatically generate and evaluate unit tests for a given prompt. PromptPex extracts input and output specifications from a prompt and uses them to generate diverse, targeted, and valid unit tests. These tests are instrumental in identifying regressions when a prompt is changed and also serve as a tool to understand how prompts are interpreted by different models. We use PromptPex to generate tests for eight benchmark prompts and evaluate the quality of the generated tests by seeing if they can cause each of four diverse models to produce invalid output. PromptPex consistently creates tests that result in more invalid model outputs than a carefully constructed baseline LLM-based test generator. Furthermore, by extracting concrete specifications from the input prompt, PromptPex allows prompt writers to clearly understand and test specific aspects of their prompts. The source code of PromptPex is available at https://github.com/microsoft/promptpex.

PromptPex: Automatic Test Generation for Language Model Prompts

TL;DR

PromptPex introduces an automated framework to extract input specifications and output rules from a prompt, then generates and evaluates tests across multiple models to detect noncompliant outputs. The approach treats prompts as software-like artifacts with pre and postconditions, enabling systematic testing and model migration insights. Eight benchmark prompts and cross-model experiments demonstrate that specification-driven tests outperform a baseline in exposing prompt weaknesses and differentiating model capabilities. This work provides tangible artifacts for prompt engineers to reason about behavior across models and time, aiding robustness and migration of prompts to newer models.

Abstract

Large language models (LLMs) are being used in many applications and prompts for these models are integrated into software applications as code-like artifacts. These prompts behave much like traditional software in that they take inputs, generate outputs, and perform some specific function. However, prompts differ from traditional code in many ways and require new approaches to ensure that they are robust. For example, unlike traditional software the output of a prompt depends on the AI model that interprets it. Also, while natural language prompts are easy to modify, the impact of updates is harder to predict. New approaches to testing, debugging, and modifying prompts with respect to the model running them are required. To address some of these issues, we developed PromptPex, an LLM-based tool to automatically generate and evaluate unit tests for a given prompt. PromptPex extracts input and output specifications from a prompt and uses them to generate diverse, targeted, and valid unit tests. These tests are instrumental in identifying regressions when a prompt is changed and also serve as a tool to understand how prompts are interpreted by different models. We use PromptPex to generate tests for eight benchmark prompts and evaluate the quality of the generated tests by seeing if they can cause each of four diverse models to produce invalid output. PromptPex consistently creates tests that result in more invalid model outputs than a carefully constructed baseline LLM-based test generator. Furthermore, by extracting concrete specifications from the input prompt, PromptPex allows prompt writers to clearly understand and test specific aspects of their prompts. The source code of PromptPex is available at https://github.com/microsoft/promptpex.

Paper Structure

This paper contains 67 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Example illustrating that the use of PromptPex to test a given prompt against different models. For a given prompt (labeled System Prompt) and PromptPex-generated test input (on the left), the resulting output differs significantly depending on what AI model is used to interpret it. PromptPex automatically generates test input based on the prompt and evaluates whether the output is compliant with what the prompt specifies. Because the prompt specifies "Return only the part of speech tag" the lower two models produced non-compliant output.
  • Figure 2: Part-of-Speech Prompt
  • Figure 3: Extracted Input Specification for Part-of-Speech Prompt
  • Figure 4: Extracted Output Rules for Part-of-Speech Prompt
  • Figure 5: Tests Generated for Part-of-Speech Prompt
  • ...and 8 more figures