How are Prompts Different in Terms of Sensitivity?

Sheng Lu; Hendrik Schuff; Iryna Gurevych

How are Prompts Different in Terms of Sensitivity?

Sheng Lu, Hendrik Schuff, Iryna Gurevych

TL;DR

This work introduces sensitivity-aware decoding which incorporates sensitivity estimation as a penalty term in the standard greedy decoding, and shows that this approach is particularly helpful when information in the input is scarce.

Abstract

In-context learning (ICL) has become one of the most popular learning paradigms. While there is a growing body of literature focusing on prompt engineering, there is a lack of systematic analysis comparing the effects of prompts across different models and tasks. To address this gap, we present a comprehensive prompt analysis based on the sensitivity of a function. Our analysis reveals that sensitivity is an unsupervised proxy for model performance, as it exhibits a strong negative correlation with accuracy. We use gradient-based saliency scores to empirically demonstrate how different prompts affect the relevance of input tokens to the output, resulting in different levels of sensitivity. Furthermore, we introduce sensitivity-aware decoding which incorporates sensitivity estimation as a penalty term in the standard greedy decoding. We show that this approach is particularly helpful when information in the input is scarce. Our work provides a fresh perspective on the analysis of prompts, and contributes to a better understanding of the mechanism of ICL.

How are Prompts Different in Terms of Sensitivity?

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 13 figures, 15 tables)

This paper contains 23 sections, 5 equations, 13 figures, 15 tables.

Introduction
Background
In-context learning
Prompt engineering
Prompt analysis
Sensitivity
Experiment settings
Results
Instruction, knowledge, chain-of-thought
What happened to Flan-T5 with zero?
The effect of decoding strategies
Open-ended generation
Gradient-based saliency scores
Sensitivity-aware decoding
Conclusion
...and 8 more sections

Figures (13)

Figure 1: (a) We generate synthetic data for testing instances using hahn2021sensitivity's framework. (b) We perform inference multiple times using the original and synthetic data, and calculate sensitivity based on the predictions.
Figure 2: The average accuracy and sensitivity of each model using various prompts across different datasets. * indicates prompts that are not tested on all datasets.
Figure 3: The accuracy and sensitivity of different models using base_a, base_b, CoT_base_a, and CoT.
Figure 4: The accuracy and sensitivity of predictions obtained using greedy decoding and Top-k sampling across different models.
Figure 5: Saliency scores over tokens of CoLA instances with base_b obtained using GPT-6B-JT.
...and 8 more figures

How are Prompts Different in Terms of Sensitivity?

TL;DR

Abstract

How are Prompts Different in Terms of Sensitivity?

Authors

TL;DR

Abstract

Table of Contents

Figures (13)