Table of Contents
Fetching ...

PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

Eliya Habba, Noam Dahan, Gili Lior, Gabriel Stanovsky

Abstract

Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible - working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. All resources, including the Python API, source code, user-friendly web interface, and demonstration video, are available at: https://eliyahabba.github.io/PromptSuite/.

PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

Abstract

Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible - working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. All resources, including the Python API, source code, user-friendly web interface, and demonstration video, are available at: https://eliyahabba.github.io/PromptSuite/.

Paper Structure

This paper contains 27 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: PromptSuite framework: configure a modular prompt, and apply component-wise perturbations. This modularity enables PromptSuite to generalize across tasks and adapt to diverse data.
  • Figure 2: PromptSuite's web UI. Left-to-right: uploading a dataset; configuring the template and choosing perturbations; and generating a multi-prompt dataset. The presented example demonstrates a single prompt variation, with changes to the prompt format and instance content.
  • Figure 3: Multi-prompt evaluation results using PromptSuite. The boxplots illustrate variance across different prompt perturbations, revealing models' sensitivity to prompt variations and underscoring the utility of PromptSuite for deriving robust and meaningful evaluations of LLM capabilities.
  • Figure 4: Analysis of how perturbations to individual prompt components affect model sensitivity on GPQA-Diamond. Each boxplot represents an experiment in which a single prompt component was varied while all others remained fixed.
  • Figure 5: Analysis of how perturbations to individual prompt components affect model sensitivity on SQuAD and GSM8K. Each boxplot represents an experiment in which a single prompt component was varied while all others remained fixed.