Table of Contents
Fetching ...

PLeak: Prompt Leaking Attacks against Large Language Model Applications

Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, Yinzhi Cao

TL;DR

PLeak introduces an automated, closed-box prompt leaking attack against LLM applications by optimizing adversarial queries on shadow prompts and a shadow LLM. The method uses incremental search to reveal system-prompt tokens and an adversarial transformation to bypass defenses, followed by post-processing to reconstruct the target prompt. Across offline benchmarks and real-world Poe deployments, PLeak consistently outperforms manually curated and jailbreaking-inspired baselines in both exact and semantic leakage metrics. The work highlights significant IP risks for LLM apps and discusses defenses, responsible disclosure, and future directions for robust prompt privacy.

Abstract

Large Language Models (LLMs) enable a new ecosystem with many downstream applications, called LLM applications, with different natural language processing tasks. The functionality and performance of an LLM application highly depend on its system prompt, which instructs the backend LLM on what task to perform. Therefore, an LLM application developer often keeps a system prompt confidential to protect its intellectual property. As a result, a natural attack, called prompt leaking, is to steal the system prompt from an LLM application, which compromises the developer's intellectual property. Existing prompt leaking attacks primarily rely on manually crafted queries, and thus achieve limited effectiveness. In this paper, we design a novel, closed-box prompt leaking attack framework, called PLeak, to optimize an adversarial query such that when the attacker sends it to a target LLM application, its response reveals its own system prompt. We formulate finding such an adversarial query as an optimization problem and solve it with a gradient-based method approximately. Our key idea is to break down the optimization goal by optimizing adversary queries for system prompts incrementally, i.e., starting from the first few tokens of each system prompt step by step until the entire length of the system prompt. We evaluate PLeak in both offline settings and for real-world LLM applications, e.g., those on Poe, a popular platform hosting such applications. Our results show that PLeak can effectively leak system prompts and significantly outperforms not only baselines that manually curate queries but also baselines with optimized queries that are modified and adapted from existing jailbreaking attacks. We responsibly reported the issues to Poe and are still waiting for their response. Our implementation is available at this repository: https://github.com/BHui97/PLeak.

PLeak: Prompt Leaking Attacks against Large Language Model Applications

TL;DR

PLeak introduces an automated, closed-box prompt leaking attack against LLM applications by optimizing adversarial queries on shadow prompts and a shadow LLM. The method uses incremental search to reveal system-prompt tokens and an adversarial transformation to bypass defenses, followed by post-processing to reconstruct the target prompt. Across offline benchmarks and real-world Poe deployments, PLeak consistently outperforms manually curated and jailbreaking-inspired baselines in both exact and semantic leakage metrics. The work highlights significant IP risks for LLM apps and discusses defenses, responsible disclosure, and future directions for robust prompt privacy.

Abstract

Large Language Models (LLMs) enable a new ecosystem with many downstream applications, called LLM applications, with different natural language processing tasks. The functionality and performance of an LLM application highly depend on its system prompt, which instructs the backend LLM on what task to perform. Therefore, an LLM application developer often keeps a system prompt confidential to protect its intellectual property. As a result, a natural attack, called prompt leaking, is to steal the system prompt from an LLM application, which compromises the developer's intellectual property. Existing prompt leaking attacks primarily rely on manually crafted queries, and thus achieve limited effectiveness. In this paper, we design a novel, closed-box prompt leaking attack framework, called PLeak, to optimize an adversarial query such that when the attacker sends it to a target LLM application, its response reveals its own system prompt. We formulate finding such an adversarial query as an optimization problem and solve it with a gradient-based method approximately. Our key idea is to break down the optimization goal by optimizing adversary queries for system prompts incrementally, i.e., starting from the first few tokens of each system prompt step by step until the entire length of the system prompt. We evaluate PLeak in both offline settings and for real-world LLM applications, e.g., those on Poe, a popular platform hosting such applications. Our results show that PLeak can effectively leak system prompts and significantly outperforms not only baselines that manually curate queries but also baselines with optimized queries that are modified and adapted from existing jailbreaking attacks. We responsibly reported the issues to Poe and are still waiting for their response. Our implementation is available at this repository: https://github.com/BHui97/PLeak.
Paper Structure (20 sections, 9 equations, 4 figures, 14 tables, 3 algorithms)

This paper contains 20 sections, 9 equations, 4 figures, 14 tables, 3 algorithms.

Figures (4)

  • Figure 1: The overview pipeline of PLeak, which contains two phases, i.e., Phase 1: Offline Adversarial Query (AQ) Optimization and Phase 2: Target System Prompt Reconstruction. Specifically, Phase 1 has seven steps: (a) PLeak initializes $n$ AQs and concatenates each AQ with each shadow system prompt in $D_s$, (b) PLeak queries the shadow LLM with the concatenated shadow system prompt + AQ, (c) PLeak computes loss between the responses of the shadow LLM and the shadow system prompts, (d) PLeak updates the AQs based on the loss, (e) PLeak repeats the previous steps until the loss does not decrease and gets the final AQs, (f) PLeak transforms each AQ using an adversarial transformation, and (g) PLeak provides the transformed AQs for Phase 2. Then, Phase 2 reconstructs the target system prompt from the responses of the target LLM application for the transformed AQs.
  • Figure 2: [RQ1] EED ($\downarrow$) and SS ($\uparrow$) Scores of Perez et al. perez2022ignore, Zhang et al. zhang2023prompts, a modified GCG zou2023universal for prompt leaking (called GCG-leak), a modified AutoDAN liu2023autodan (called AutoDAN-leak), and PLeak against LLM applications with target system prompts from Tomatoes Dataset.
  • Figure 3: [RQ3-1] Evaluation of PLeak with different sizes of the shadow dataset and different length of AQ for four metrics.
  • Figure 4: Evaluation of PLeak under different number of exemplars and different steps for four metrics.