On the Worst Prompt Performance of Large Language Models

Bowen Cao; Deng Cai; Zhisong Zhang; Yuexian Zou; Wai Lam

On the Worst Prompt Performance of Large Language Models

Bowen Cao, Deng Cai, Zhisong Zhang, Yuexian Zou, Wai Lam

TL;DR

RobustAlpacaEval is introduced, a new benchmark that consists of semantically equivalent case-level queries and emphasizes the importance of using the worst prompt performance to gauge the lower bound of model performance, and illustrates the difficulty in identifying the worst prompt from both model-agnostic and model-dependent perspectives.

Abstract

The performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts, which raises significant concerns about their reliability in real-world scenarios. Existing studies often divide prompts into task-level instructions and case-level inputs and primarily focus on evaluating and improving robustness against variations in tasks-level instructions. However, this setup fails to fully address the diversity of real-world user queries and assumes the existence of task-specific datasets. To address these limitations, we introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries and emphasizes the importance of using the worst prompt performance to gauge the lower bound of model performance. Extensive experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance; for instance, a difference of 45.48% between the worst and best performance for the Llama-2-70B-chat model, with its worst performance dipping as low as 9.38%. We further illustrate the difficulty in identifying the worst prompt from both model-agnostic and model-dependent perspectives, emphasizing the absence of a shortcut to characterize the worst prompt. We also attempt to enhance the worst prompt performance using existing prompt engineering and prompt consistency methods, but find that their impact is limited. These findings underscore the need to create more resilient LLMs that can maintain high performance across diverse prompts. Data and code are available at https://github.com/cbwbuaa/On-the-Worst-Prompt- Performance-of-LLMs.

On the Worst Prompt Performance of Large Language Models

TL;DR

Abstract

Paper Structure (26 sections, 4 equations, 10 figures, 5 tables)

This paper contains 26 sections, 4 equations, 10 figures, 5 tables.

Introduction
Related Work
Prompt Consistency.
Prompt Engineering.
Benchmarking the Worst Prompt Performance
A New Benchmark: RobustAlpacaEval
Data.
Metrics.
Results
Identifying the Worst Prompts
Model-agnostic Analysis
Overlap of the worst prompts across different models.
Performance Rankings of prompts across different models.
Overlap of Sensitive Cases.
Discussion.
...and 11 more sections

Figures (10)

Figure 1: An example illustrating the gap between existing benchmarks that evaluate prompt consistency and real user queries.
Figure 2: The overlap rate of model-agnostic worst-$k$ prompts across different models. The low result indicates a minimal occurrence of universally poor prompts.
Figure 3: IoU fluctuation across varying sensitive case thresholds for diverse model sets. The IoU drops below 0.2 across all models, indicating a scarcity of model-agnostic traits.
Figure 4: Distribution of Pearson correlation coefficients between model performance and prompt perplexity (left) and prompt's Min-K% Prob (right) for Llama-family models across all cases. The absolute values of correlation in the ranges of (0, 0.3], (0.3, 0.6], and (0.6, 1] respectively denote weak/no correlation, moderate correlation, and strong correlation.
Figure 5: (a) Visualization of Llama-2-7B-chat model’s hidden states using 2-dimensional PCA. The color gradient, from light to dark, represents the ranking of model performance on each case's 11 prompts, from low to high. (b) Probing Llama-2-7B-chat model’s hidden states for prompt scoring. The x-axis stands for training steps. The y-axis represents the accuracy of the model's predictions, quantified as the proportion of correctly judged prompt pairs out of all test pairs.
...and 5 more figures

On the Worst Prompt Performance of Large Language Models

TL;DR

Abstract

On the Worst Prompt Performance of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)