Table of Contents
Fetching ...

PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines

Reya Vir, Shreya Shankar, Harrison Chase, Will Fu-Hinthorn, Aditya Parameswaran

TL;DR

PROMPTEVALS introduces PromptEvals, a large-scale dataset of 2087 production-oriented LLM prompts and 12,623 ground-truth assertion criteria to study automatic generation of guardrails for LLM pipelines. The work establishes a benchmark with train/validation/test splits and evaluates GPT-4o against open-source models (Mistral-7b, Llama-3-8b), finding that fine-tuned open-source models outperform GPT-4o on semantic-F1 and offer lower latency. A three-step criterion construction process (initial generation, manual augmentation, refinement) paired with Semantic F1-based evaluation enables robust, domain-aware assertion development. By releasing both PromptEvals and fine-tuned models, the authors aim to advance reliable, cost-effective, production-ready prompt engineering and LLM reliability research across domains.

Abstract

Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains -- such as finance, marketing, and e-commerce. However, when running them in production across many inputs, they often fail to follow instructions or meet developer expectations. To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential. Yet, determining the right set of assertions that capture developer requirements for a task is challenging. In this paper, we introduce PROMPTEVALS, a dataset of 2087 LLM pipeline prompts with 12623 corresponding assertion criteria, sourced from developers using our open-source LLM pipeline tools. This dataset is 5x larger than previous collections. Using a hold-out test split of PROMPTEVALS as a benchmark, we evaluated closed- and open-source models in generating relevant assertions. Notably, our fine-tuned Mistral and Llama 3 models outperform GPT-4o by 20.93% on average, offering both reduced latency and improved performance. We believe our dataset can spur further research in LLM reliability, alignment, and prompt engineering.

PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines

TL;DR

PROMPTEVALS introduces PromptEvals, a large-scale dataset of 2087 production-oriented LLM prompts and 12,623 ground-truth assertion criteria to study automatic generation of guardrails for LLM pipelines. The work establishes a benchmark with train/validation/test splits and evaluates GPT-4o against open-source models (Mistral-7b, Llama-3-8b), finding that fine-tuned open-source models outperform GPT-4o on semantic-F1 and offer lower latency. A three-step criterion construction process (initial generation, manual augmentation, refinement) paired with Semantic F1-based evaluation enables robust, domain-aware assertion development. By releasing both PromptEvals and fine-tuned models, the authors aim to advance reliable, cost-effective, production-ready prompt engineering and LLM reliability research across domains.

Abstract

Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains -- such as finance, marketing, and e-commerce. However, when running them in production across many inputs, they often fail to follow instructions or meet developer expectations. To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential. Yet, determining the right set of assertions that capture developer requirements for a task is challenging. In this paper, we introduce PROMPTEVALS, a dataset of 2087 LLM pipeline prompts with 12623 corresponding assertion criteria, sourced from developers using our open-source LLM pipeline tools. This dataset is 5x larger than previous collections. Using a hold-out test split of PROMPTEVALS as a benchmark, we evaluated closed- and open-source models in generating relevant assertions. Notably, our fine-tuned Mistral and Llama 3 models outperform GPT-4o by 20.93% on average, offering both reduced latency and improved performance. We believe our dataset can spur further research in LLM reliability, alignment, and prompt engineering.

Paper Structure

This paper contains 30 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Examples of criteria pairs and their semantic similarity scores. High-scoring pairs typically represent constraints that are explicitly stated or logically derived from the prompt, while low-scoring pairs often include vague, generic, or difficult-to-measure constraints.
  • Figure 2: Distribution of Domains and Subdomains of Tasks Represented in PromptEvals.
  • Figure 3: Constraint Type Co-Occurrence Matrix
  • Figure 4: Distribution of Ground Truth Criteria by Type