Prompts have evil twins

Rimon Melamed; Lucas H. McCabe; Tanay Wakhare; Yejin Kim; H. Howie Huang; Enric Boix-Adsera

Prompts have evil twins

Rimon Melamed, Lucas H. McCabe, Tanay Wakhare, Yejin Kim, H. Howie Huang, Enric Boix-Adsera

Abstract

We discover that many natural-language prompts can be replaced by corresponding prompts that are unintelligible to humans but that provably elicit similar behavior in language models. We call these prompts "evil twins" because they are obfuscated and uninterpretable (evil), but at the same time mimic the functionality of the original natural-language prompts (twins). Remarkably, evil twins transfer between models. We find these prompts by solving a maximum-likelihood problem which has applications of independent interest.

Prompts have evil twins

Abstract

Paper Structure (30 sections, 16 equations, 10 figures, 3 tables, 2 algorithms)

This paper contains 30 sections, 16 equations, 10 figures, 3 tables, 2 algorithms.

Introduction
Our contributions
Functional similarity between prompts
Finding prompts with similar functionality
Investigations on optimized prompts
Related work
How models parse prompts
Prompt optimization
Preliminaries
Autoregressive language models
Probability of a document
Optimization problem
KL divergence between prompts
Optimization problem
Comparison of optimization methods
...and 15 more sections

Figures (10)

Figure 1: Five examples of ground truth prompts ${\boldsymbol p}^*$ and corresponding "evil twins" ${\boldsymbol p}$. Each evil twin is found by solving the maximum-likelihood problem \ref{['eq:mle-def-intro']} on 100 documents generated from the ground truth prompt. We compare the evil twins to a baseline created by asking GPT-4 to generate a prompt that could have created the 100 documents. Surprisingly, the optimized prompts, although incoherent, are more functionally similar to the ground truth prompt (lower KL divergence) than the GPT-4 reconstruction. Details are in Section \ref{['sec:methods-comparison']}. Figure \ref{['fig:full-kl-results']} in the appendix contains a full table of results.
Figure 2: Win rate between various methods across optimizations of 100 ground truth prompts with 100 documents each. Given two prompts to compare, we compute the KL divergence for both prompts with respect to the ground truth, and the method with lower KL wins. Darker shades indicate ROW method is better than COLUMN method. Full optimization results are shown in Appendix \ref{['app:recon-examples']}. In the case of ties, the win is shared by both methods. The most effective method is GCG with warm starts.
Figure 3: Transferability between model sizes. For each model size in the Pythia suite (excluding 12B), and each of 100 prompt sentences from the HellaSwag dataset zellers-etal-2019-hellaswag, we run GCG with cold start to generate an optimized prompt based on 100 documents from the original prompt. For each optimized prompt at each model size, we compute the KL divergence for the optimized prompt at all other model sizes. The measured ratio is $\frac{d_{KL,\mathrm{dest}}({\boldsymbol p}^* \parallel {\boldsymbol p}_{\mathrm{source}})}{{d_{KL,\mathrm{source}} ({\boldsymbol p}^* \parallel {\boldsymbol p}_{\mathrm{source}})}}$ averaged over all 100 prompts, where ${\boldsymbol p}_{\mathrm{source}}$ represents the optimized prompt from the source model, $d_{KL,\mathrm{source}}$ represents the KL divergence as measured on the source model, and $d_{KL,\mathrm{dest}}$ represents the KL divergence as measured on the destination model. Full results are shown in Table \ref{['tab:transfer-results-pythia']}.
Figure 4: Individual token importance in optimized and original prompts for various models. For each of the 100 prompts from the Alpaca alpaca and OpenHermes-2.5 datasets, and for each of the first 6 positions $i \in \{1,\ldots,6\}$ of the prompt, we compute the KL divergence $d_{KL}({\boldsymbol p} \parallel r_i({\boldsymbol p}))$ when we replace position $i$ with the [UNK] token. Each histogram is over all positions and prompts (either the original prompts or optimized prompts) for a given model. The optimized prompts appear to be generally more sensitive.
Figure 5: Hard prompt optimization results for various fluency penalties $\gamma$ with the Vicuna-7b model. We use a 100 prompt subset from Alpaca, and Vicuna-7b from a GPT-4 warm start. The optimization proceeds for 50 epochs, and we take the final values of the KL divergence to the ground truth, and the log-probability of the optimized prompt.
...and 5 more figures

Theorems & Definitions (1)

Definition 1

Prompts have evil twins

Abstract

Prompts have evil twins

Authors

Abstract

Table of Contents

Figures (10)

Theorems & Definitions (1)