Table of Contents
Fetching ...

POSIX: A Prompt Sensitivity Index For Large Language Models

Anwoy Chatterjee, H S V N S Kowndinya Renduchintala, Sumit Bhatia, Tanmoy Chakraborty

TL;DR

This work defines POSIX, a probabilistic Prompt Sensitivity Index, to quantify how LLM outputs shift under intent-aligned prompt variations by comparing log-likelihoods of responses with normalized lengths. The index aggregates four high-signal factors—response diversity, response distribution entropy, semantic coherence, and confidence variance—across diverse variation types (spelling, templates, paraphrases) and tasks. Through extensive experiments on MMLU and Alpaca with eight open-source models, the study shows that increasing model size or strictly applying instruction tuning does not guarantee reduced prompt sensitivity, while adding few-shot exemplars reliably lowers POSIX. The findings reveal task-dependent sensitivities (templates matter more for MCQs; paraphrases matter for open-ended prompts) and highlight POSIX as a practical tool for holistic LLM evaluation and prompting guidance. The authors provide open-source code to reproduce and extend their analysis.

Abstract

Despite their remarkable capabilities, Large Language Models (LLMs) are found to be surprisingly sensitive to minor variations in prompts, often generating significantly divergent outputs in response to minor variations in the prompts, such as spelling errors, alteration of wording or the prompt template. However, while assessing the quality of an LLM, the focus often tends to be solely on its performance on downstream tasks, while very little to no attention is paid to prompt sensitivity. To fill this gap, we propose POSIX - a novel PrOmpt Sensitivity IndeX as a reliable measure of prompt sensitivity, thereby offering a more comprehensive evaluation of LLM performance. The key idea behind POSIX is to capture the relative change in loglikelihood of a given response upon replacing the corresponding prompt with a different intent-preserving prompt. We provide thorough empirical evidence demonstrating the efficacy of POSIX in capturing prompt sensitivity and subsequently use it to measure and thereby compare prompt sensitivity of various open-source LLMs. We find that merely increasing the parameter count or instruction tuning does not necessarily reduce prompt sensitivity whereas adding some few-shot exemplars, even just one, almost always leads to significant decrease in prompt sensitivity. We also find that alterations to prompt template lead to the highest sensitivity in the case of MCQ type tasks, whereas paraphrasing results in the highest sensitivity in open-ended generation tasks. The code for reproducing our results is open-sourced at https://github.com/kowndinya-renduchintala/POSIX.

POSIX: A Prompt Sensitivity Index For Large Language Models

TL;DR

This work defines POSIX, a probabilistic Prompt Sensitivity Index, to quantify how LLM outputs shift under intent-aligned prompt variations by comparing log-likelihoods of responses with normalized lengths. The index aggregates four high-signal factors—response diversity, response distribution entropy, semantic coherence, and confidence variance—across diverse variation types (spelling, templates, paraphrases) and tasks. Through extensive experiments on MMLU and Alpaca with eight open-source models, the study shows that increasing model size or strictly applying instruction tuning does not guarantee reduced prompt sensitivity, while adding few-shot exemplars reliably lowers POSIX. The findings reveal task-dependent sensitivities (templates matter more for MCQs; paraphrases matter for open-ended prompts) and highlight POSIX as a practical tool for holistic LLM evaluation and prompting guidance. The authors provide open-source code to reproduce and extend their analysis.

Abstract

Despite their remarkable capabilities, Large Language Models (LLMs) are found to be surprisingly sensitive to minor variations in prompts, often generating significantly divergent outputs in response to minor variations in the prompts, such as spelling errors, alteration of wording or the prompt template. However, while assessing the quality of an LLM, the focus often tends to be solely on its performance on downstream tasks, while very little to no attention is paid to prompt sensitivity. To fill this gap, we propose POSIX - a novel PrOmpt Sensitivity IndeX as a reliable measure of prompt sensitivity, thereby offering a more comprehensive evaluation of LLM performance. The key idea behind POSIX is to capture the relative change in loglikelihood of a given response upon replacing the corresponding prompt with a different intent-preserving prompt. We provide thorough empirical evidence demonstrating the efficacy of POSIX in capturing prompt sensitivity and subsequently use it to measure and thereby compare prompt sensitivity of various open-source LLMs. We find that merely increasing the parameter count or instruction tuning does not necessarily reduce prompt sensitivity whereas adding some few-shot exemplars, even just one, almost always leads to significant decrease in prompt sensitivity. We also find that alterations to prompt template lead to the highest sensitivity in the case of MCQ type tasks, whereas paraphrasing results in the highest sensitivity in open-ended generation tasks. The code for reproducing our results is open-sourced at https://github.com/kowndinya-renduchintala/POSIX.
Paper Structure (26 sections, 2 equations, 9 figures, 9 tables)

This paper contains 26 sections, 2 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Correlation plots of $\psi$ with each of the four factors described in Section \ref{['sec:what_does_posix_capture']} in the case of MMLU: (a) Response Diversity; (b) Response Distribution Entropy; (c) Semantic Coherence; (d) Variance in Confidence.
  • Figure 2: Box plots depicting the distribution of $\psi_{\mathcal{M}, \mathbf{X}}$ for different instances of $\mathcal{M}$. The first plot corresponds to $\mathbf{X}$'s from MMLU dataset (MCQs) and the second plot corresponds to $\mathbf{X}$'s from the Alpaca dataset (open-ended generation).
  • Figure 3: Box plots depicting distribution of $\psi_{\mathcal{M}, \mathbf{X}}$ for two differently sized OLMo models (1B and 7B).
  • Figure 4: Box plots depicting distribution of $\psi_{\mathcal{M}, \mathbf{X}}$ for two differently sized Llama-2 models (7B and 13B).
  • Figure 5: Box plots depicting distribution of $\psi_{\mathcal{M}, \mathbf{X}}$ for various 7B models on Big Bench Hard (BBH) dataset
  • ...and 4 more figures

Theorems & Definitions (4)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Definition 3.4