Table of Contents
Fetching ...

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

Federico Errica, Giuseppe Siracusano, Davide Sanvito, Roberto Bifulco

TL;DR

The paper tackles the problem of LLM fragility to prompt engineering by introducing two diagnostic metrics, sensitivity and consistency, which quantify how predictions vary with prompt variations and how stable they are across samples of the same class. Sensitivity assesses prompt-induced prediction changes without ground-truth labels, while consistency relies on distributional similarity across class-matched samples, both estimated via multiple prompt rephrasings. Through experiments on five classification benchmarks with multiple models and prompting strategies, the authors show these metrics offer information beyond accuracy and can guide prompt design and model selection, highlighting that low sensitivity and high consistency are desirable for robust production use. Limitations include reliance on classification tasks and sampling choices, with ethical considerations addressing potential misuse and the need for responsible deployment.

Abstract

Large Language Models (LLMs) changed the way we design and interact with software systems. Their ability to process and extract information from text has drastically improved productivity in a number of routine tasks. Developers that want to include these models in their software stack, however, face a dreadful challenge: debugging LLMs' inconsistent behavior across minor variations of the prompt. We therefore introduce two metrics for classification tasks, namely sensitivity and consistency, which are complementary to task performance. First, sensitivity measures changes of predictions across rephrasings of the prompt, and does not require access to ground truth labels. Instead, consistency measures how predictions vary across rephrasings for elements of the same class. We perform an empirical comparison of these metrics on text classification tasks, using them as guideline for understanding failure modes of the LLM. Our hope is that sensitivity and consistency will be helpful to guide prompt engineering and obtain LLMs that balance robustness with performance.

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

TL;DR

The paper tackles the problem of LLM fragility to prompt engineering by introducing two diagnostic metrics, sensitivity and consistency, which quantify how predictions vary with prompt variations and how stable they are across samples of the same class. Sensitivity assesses prompt-induced prediction changes without ground-truth labels, while consistency relies on distributional similarity across class-matched samples, both estimated via multiple prompt rephrasings. Through experiments on five classification benchmarks with multiple models and prompting strategies, the authors show these metrics offer information beyond accuracy and can guide prompt design and model selection, highlighting that low sensitivity and high consistency are desirable for robust production use. Limitations include reliance on classification tasks and sampling choices, with ethical considerations addressing potential misuse and the need for responsible deployment.

Abstract

Large Language Models (LLMs) changed the way we design and interact with software systems. Their ability to process and extract information from text has drastically improved productivity in a number of routine tasks. Developers that want to include these models in their software stack, however, face a dreadful challenge: debugging LLMs' inconsistent behavior across minor variations of the prompt. We therefore introduce two metrics for classification tasks, namely sensitivity and consistency, which are complementary to task performance. First, sensitivity measures changes of predictions across rephrasings of the prompt, and does not require access to ground truth labels. Instead, consistency measures how predictions vary across rephrasings for elements of the same class. We perform an empirical comparison of these metrics on text classification tasks, using them as guideline for understanding failure modes of the LLM. Our hope is that sensitivity and consistency will be helpful to guide prompt engineering and obtain LLMs that balance robustness with performance.
Paper Structure (24 sections, 8 equations, 6 figures, 2 tables)

This paper contains 24 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Example of GPT3.5 behavior when classifying a question in terms of what it is referring to. A slight change in the definition of the class "ENTY" causes a minor prompt variation that disrupts the LLM's prediction. This happens under the hood, making it very hard for a developer to debug the program. Note that the same might happen, for instance, if the ordering or naming of variables is changed (hence the quote of Section \ref{['sec:introduction']}).
  • Figure 2: Predicted class distributions over prompt rephrasings $p_\tau$ across three samples of the same class Person (TREC dataset, Section \ref{['sec:experiments']}). Merely syntactic prompt rephrasings can produce very diverse distributions. For instance, sample 2 is characterized by high sensitivity and (compared to others) low consistency.
  • Figure 3: Top: We show the sensitivity for each sample of the dataset according to different prompting strategies. Bottom: we plot the sensitivity $S_\tau$ for each class and prompting strategy (Llama3). We remind that the prompting strategy itself might be considered another semantically equivalent rephrasing of the initial prompt $\rho_0$ (Section \ref{['sec:method']}).
  • Figure 4: Top: we visualize the matrix of pairwise $C_y(\bm{x},\bm{x}')$ for three different TREC classes, using Llama3 as a classifier. Bottom: we build a histogram for each of the above matrices, to show the distribution of consistency across samples of a given class.
  • Figure 5: We show the violin plot of the Llama3 consistency over samples of the same classes, arranged by prompting technique, on different datasets.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 1: Sensitivity
  • Definition 2: Consistency