Table of Contents
Fetching ...

Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications

Junlin Wang, Tianyi Yang, Roy Xie, Bhuwan Dhingra

TL;DR

The Raccoon benchmark is introduced which comprehensively evaluates a model's susceptibility to prompt extraction attacks, and highlights universal susceptibility to prompt theft in the absence of defenses.

Abstract

With the proliferation of LLM-integrated applications such as GPT-s, millions are deployed, offering valuable services through proprietary instruction prompts. These systems, however, are prone to prompt extraction attacks through meticulously designed queries. To help mitigate this problem, we introduce the Raccoon benchmark which comprehensively evaluates a model's susceptibility to prompt extraction attacks. Our novel evaluation method assesses models under both defenseless and defended scenarios, employing a dual approach to evaluate the effectiveness of existing defenses and the resilience of the models. The benchmark encompasses 14 categories of prompt extraction attacks, with additional compounded attacks that closely mimic the strategies of potential attackers, alongside a diverse collection of defense templates. This array is, to our knowledge, the most extensive compilation of prompt theft attacks and defense mechanisms to date. Our findings highlight universal susceptibility to prompt theft in the absence of defenses, with OpenAI models demonstrating notable resilience when protected. This paper aims to establish a more systematic benchmark for assessing LLM robustness against prompt extraction attacks, offering insights into their causes and potential countermeasures. Resources of Raccoon are publicly available at https://github.com/M0gician/RaccoonBench.

Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications

TL;DR

The Raccoon benchmark is introduced which comprehensively evaluates a model's susceptibility to prompt extraction attacks, and highlights universal susceptibility to prompt theft in the absence of defenses.

Abstract

With the proliferation of LLM-integrated applications such as GPT-s, millions are deployed, offering valuable services through proprietary instruction prompts. These systems, however, are prone to prompt extraction attacks through meticulously designed queries. To help mitigate this problem, we introduce the Raccoon benchmark which comprehensively evaluates a model's susceptibility to prompt extraction attacks. Our novel evaluation method assesses models under both defenseless and defended scenarios, employing a dual approach to evaluate the effectiveness of existing defenses and the resilience of the models. The benchmark encompasses 14 categories of prompt extraction attacks, with additional compounded attacks that closely mimic the strategies of potential attackers, alongside a diverse collection of defense templates. This array is, to our knowledge, the most extensive compilation of prompt theft attacks and defense mechanisms to date. Our findings highlight universal susceptibility to prompt theft in the absence of defenses, with OpenAI models demonstrating notable resilience when protected. This paper aims to establish a more systematic benchmark for assessing LLM robustness against prompt extraction attacks, offering insights into their causes and potential countermeasures. Resources of Raccoon are publicly available at https://github.com/M0gician/RaccoonBench.
Paper Structure (45 sections, 10 equations, 8 figures, 3 tables)

This paper contains 45 sections, 10 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: An example of a prompt extraction attack on LLM-integrated Application with a defense.
  • Figure 2: Model susceptibility scores under four settings: DefenselessSingular, DefenselessCompound, DefendedSingular, DefendedCompound. Under the defended setting, there is an in-context defense safeguarding the instruction prompt. A larger area means the model is more susceptible to prompt thefts.
  • Figure 3: The matrices show ASRs for each prompt extraction attack category and each model. Each attack category has three attack prompts and we show the maximum ASR here. (a) shows corresponding ASR for each singular attacks. (b) demonstrate the ASR for each compound attacks. Here compound attacks are constructed by picking the five singular attack categories and combined manually. We use abbreviations defined in Table \ref{['tab:singular_atk_descrip']}.
  • Figure 4: We reported the $\text{ModelSusceptibility}_{max}$, $\text{ModelSusceptibility}_{avg}$ and $\text{ModelSusceptibility}_{wa}$ for singular and compound attacks in the defended setting. We separate all defenses into three groups: short, medium, and long to demonstrate the effect of defense length and complexity. The red, blue, and green dash lines indicate undefended results. Defenses are working to an extent as the max, average, and percentage of working attacks are all lower than under the defenseless setting.
  • Figure 5: The relationship between model capability (AlpacaEval 2.0 Scores) and ASR.
  • ...and 3 more figures