Table of Contents
Fetching ...

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, Kai Chen

TL;DR

This work introduces ProSA, a framework designed to evaluate and comprehend prompt sensitivity in LLMs, which incorporates a novel sensitivity metric, PromptSensiScore, and leverages decoding confidence to elucidate underlying mechanisms.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but their performance is highly sensitive to the prompts utilized. This variability poses challenges for accurate assessment and user satisfaction. Current research frequently overlooks instance-level prompt variations and their implications on subjective evaluations. To address these shortcomings, we introduce ProSA, a framework designed to evaluate and comprehend prompt sensitivity in LLMs. ProSA incorporates a novel sensitivity metric, PromptSensiScore, and leverages decoding confidence to elucidate underlying mechanisms. Our extensive study, spanning multiple tasks, uncovers that prompt sensitivity fluctuates across datasets and models, with larger models exhibiting enhanced robustness. We observe that few-shot examples can alleviate this sensitivity issue, and subjective evaluations are also susceptible to prompt sensitivities, particularly in complex, reasoning-oriented tasks. Furthermore, our findings indicate that higher model confidence correlates with increased prompt robustness. We believe this work will serve as a helpful tool in studying prompt sensitivity of LLMs. The project is released at: https://github.com/open-compass/ProSA .

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

TL;DR

This work introduces ProSA, a framework designed to evaluate and comprehend prompt sensitivity in LLMs, which incorporates a novel sensitivity metric, PromptSensiScore, and leverages decoding confidence to elucidate underlying mechanisms.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but their performance is highly sensitive to the prompts utilized. This variability poses challenges for accurate assessment and user satisfaction. Current research frequently overlooks instance-level prompt variations and their implications on subjective evaluations. To address these shortcomings, we introduce ProSA, a framework designed to evaluate and comprehend prompt sensitivity in LLMs. ProSA incorporates a novel sensitivity metric, PromptSensiScore, and leverages decoding confidence to elucidate underlying mechanisms. Our extensive study, spanning multiple tasks, uncovers that prompt sensitivity fluctuates across datasets and models, with larger models exhibiting enhanced robustness. We observe that few-shot examples can alleviate this sensitivity issue, and subjective evaluations are also susceptible to prompt sensitivities, particularly in complex, reasoning-oriented tasks. Furthermore, our findings indicate that higher model confidence correlates with increased prompt robustness. We believe this work will serve as a helpful tool in studying prompt sensitivity of LLMs. The project is released at: https://github.com/open-compass/ProSA .

Paper Structure

This paper contains 37 sections, 6 equations, 31 figures, 4 tables.

Figures (31)

  • Figure 1: A Showcase of the Four Prompt Templates on MATH. These four prompt templates represent four different styles of constructing prompts, serving as an example of the diversity in human prompt expression.
  • Figure 2: A Comparision of Evaluating LLMs' Prompt Sensitivity. ✓ and ✗ indicate the accuracy of the LLM's responses. In this example, LLMs appear robust at the dataset level evaluation (calculated from the variance of different templates), but this overlooks the sensitivity of LLMs to different templates within the same instance.
  • Figure 3: Main Results of Prompt Sensitivity. The scatter represents the average performance score of 12 prompts and the PSS under different datasets.
  • Figure 4: Prompt Sensitivity vs. Model Size. The comparative charts display the relationship between the size of the model's parameters and prompts sensitivity. $\overline{PSS}$ refers to the average PSS of four datasets.
  • Figure 5: Impact of Few-shot on the Performance and Sensitivity. Conduct experiments on the CommonsenseQA and ARC-Challenge datasets using five few-shot settings and four models from the Qwen series. The blue line represents the changes in the scores of LLMs (using the left scale). The orange line represents the changes in the PSS of LLMs (using the right scale).
  • ...and 26 more figures