Table of Contents
Fetching ...

Has My System Prompt Been Used? Large Language Model Prompt Membership Inference

Roman Levin, Valeriia Cherepanova, Abhimanyu Hans, Avi Schwarzschild, Tom Goldstein

TL;DR

Prompt Detective provides a training-free statistical framework for verified prompt membership inference by comparing output distributions from a target LLM to a reference model prompted with a known system prompt. Using a permutation test on $p$-values derived from cosine similarities of $n$-fold BERT embeddings of $k$ responses per task, the method detects differences between prompts across diverse models, including hard prompts and black-box settings. The results show that even minor system-prompt changes yield distinguishable response trajectories, enabling statistically significant prompt usage verification with practical budgets. This work offers a concrete tool for protecting proprietary prompts and informs design considerations for prompt privacy in real-world LLM deployments.

Abstract

Prompt engineering has emerged as a powerful technique for optimizing large language models (LLMs) for specific applications, enabling faster prototyping and improved performance, and giving rise to the interest of the community in protecting proprietary system prompts. In this work, we explore a novel perspective on prompt privacy through the lens of membership inference. We develop Prompt Detective, a statistical method to reliably determine whether a given system prompt was used by a third-party language model. Our approach relies on a statistical test comparing the distributions of two groups of model outputs corresponding to different system prompts. Through extensive experiments with a variety of language models, we demonstrate the effectiveness of Prompt Detective for prompt membership inference. Our work reveals that even minor changes in system prompts manifest in distinct response distributions, enabling us to verify prompt usage with statistical significance.

Has My System Prompt Been Used? Large Language Model Prompt Membership Inference

TL;DR

Prompt Detective provides a training-free statistical framework for verified prompt membership inference by comparing output distributions from a target LLM to a reference model prompted with a known system prompt. Using a permutation test on -values derived from cosine similarities of -fold BERT embeddings of responses per task, the method detects differences between prompts across diverse models, including hard prompts and black-box settings. The results show that even minor system-prompt changes yield distinguishable response trajectories, enabling statistically significant prompt usage verification with practical budgets. This work offers a concrete tool for protecting proprietary prompts and informs design considerations for prompt privacy in real-world LLM deployments.

Abstract

Prompt engineering has emerged as a powerful technique for optimizing large language models (LLMs) for specific applications, enabling faster prototyping and improved performance, and giving rise to the interest of the community in protecting proprietary system prompts. In this work, we explore a novel perspective on prompt privacy through the lens of membership inference. We develop Prompt Detective, a statistical method to reliably determine whether a given system prompt was used by a third-party language model. Our approach relies on a statistical test comparing the distributions of two groups of model outputs corresponding to different system prompts. Through extensive experiments with a variety of language models, we demonstrate the effectiveness of Prompt Detective for prompt membership inference. Our work reveals that even minor changes in system prompts manifest in distinct response distributions, enabling us to verify prompt usage with statistical significance.

Paper Structure

This paper contains 31 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Prompt Detective verifies if a third-party chat bot uses a given proprietary system prompt by querying the system and comparing distribution of outputs with outputs obtained using proprietary system prompt.
  • Figure 2: Hard Examples illustrate varying degrees of similarity between the original prompts and their rephrased versions. Similarity Level 1 is highly similar, while Level 5 is completely different.
  • Figure 3: Average $p$-value computed for different number of task queries. Left: Awesome-ChatGPT-Prompts. Right: Anthropic Library. Increasing the number of generations leads to decreasing $p$-value in negative cases, but the average $p$-value for positive cases remains close to 0.5.
  • Figure 4: Effect of the number of task prompts, generations, and tokens on the performance of Prompt Detective. Average $p$-value for GPT-3.5 model on system prompts of Similarity Level 1. The left panel shows the average $p$-value vs. the number of generations used in Prompt Detective. The blue line represents results with 10 task prompts and 5–50 generations (512 tokens long) per prompt. The red line represents results with 1–10 task prompts, each with 50 generations (512 tokens long). The right panel plots $p^n_{avg}$ against the total number of tokens generated, with the green line showing results using 10 task prompts and 50 shorter generations (16–512 tokens long)
  • Figure 5: UMAP projection of generations of language model across 5 system prompts of varying similarity for one task prompt. It can be seen that generations from different, although conceptually similar system prompts, cluster together.
  • ...and 1 more figures