Table of Contents
Fetching ...

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

Yuchen Yang, Yiming Li, Hongwei Yao, Enhao Huang, Shuo Shao, Yuyi Wang, Zhibo Wang, Dacheng Tao, Zhan Qin

TL;DR

PromptCOS tackles the challenge of copyright auditing for system prompts under content-only access, where intermediate logits are unavailable. It introduces a two-phase pipeline that embeds a watermark via cyclic output signals, auxiliary tokens, and cover tokens, and verifies copyright using a sliding-window, char-level similarity on the output content. The method achieves high effectiveness and distinctiveness, preserves fidelity, remains robust against adaptive attacks, and delivers major efficiency gains over prior logits-based approaches. This work enables practical, noninvasive protection of prompt-related intellectual property in real-world LLM-based applications with broad deployment implications.

Abstract

System prompts are critical for shaping the behavior and output quality of large language model (LLM)-based applications, driving substantial investment in optimizing high-quality prompts beyond traditional handcrafted designs. However, as system prompts become valuable intellectual property, they are increasingly vulnerable to prompt theft and unauthorized use, highlighting the urgent need for effective copyright auditing, especially watermarking. Existing methods rely on verifying subtle logit distribution shifts triggered by a query. We observe that this logit-dependent verification framework is impractical in real-world content-only settings, primarily because (1) random sampling makes content-level generation unstable for verification, and (2) stronger instructions needed for content-level signals compromise prompt fidelity. To overcome these challenges, we propose PromptCOS, the first content-only system prompt copyright auditing method based on content-level output similarity. PromptCOS achieves watermark stability by designing a cyclic output signal as the conditional instruction's target. It preserves prompt fidelity by injecting a small set of auxiliary tokens to encode the watermark, leaving the main prompt untouched. Furthermore, to ensure robustness against malicious removal, we optimize cover tokens, i.e., critical tokens in the original prompt, to ensure that removing auxiliary tokens causes severe performance degradation. Experimental results show that PromptCOS achieves high effectiveness (99.3% average watermark similarity), strong distinctiveness (60.8% higher than the best baseline), high fidelity (accuracy degradation no greater than 0.6%), robustness (resilience against four potential attack categories), and high computational efficiency (up to 98.1% cost saving). Our code is available at GitHub (https://github.com/LianPing-cyber/PromptCOS).

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

TL;DR

PromptCOS tackles the challenge of copyright auditing for system prompts under content-only access, where intermediate logits are unavailable. It introduces a two-phase pipeline that embeds a watermark via cyclic output signals, auxiliary tokens, and cover tokens, and verifies copyright using a sliding-window, char-level similarity on the output content. The method achieves high effectiveness and distinctiveness, preserves fidelity, remains robust against adaptive attacks, and delivers major efficiency gains over prior logits-based approaches. This work enables practical, noninvasive protection of prompt-related intellectual property in real-world LLM-based applications with broad deployment implications.

Abstract

System prompts are critical for shaping the behavior and output quality of large language model (LLM)-based applications, driving substantial investment in optimizing high-quality prompts beyond traditional handcrafted designs. However, as system prompts become valuable intellectual property, they are increasingly vulnerable to prompt theft and unauthorized use, highlighting the urgent need for effective copyright auditing, especially watermarking. Existing methods rely on verifying subtle logit distribution shifts triggered by a query. We observe that this logit-dependent verification framework is impractical in real-world content-only settings, primarily because (1) random sampling makes content-level generation unstable for verification, and (2) stronger instructions needed for content-level signals compromise prompt fidelity. To overcome these challenges, we propose PromptCOS, the first content-only system prompt copyright auditing method based on content-level output similarity. PromptCOS achieves watermark stability by designing a cyclic output signal as the conditional instruction's target. It preserves prompt fidelity by injecting a small set of auxiliary tokens to encode the watermark, leaving the main prompt untouched. Furthermore, to ensure robustness against malicious removal, we optimize cover tokens, i.e., critical tokens in the original prompt, to ensure that removing auxiliary tokens causes severe performance degradation. Experimental results show that PromptCOS achieves high effectiveness (99.3% average watermark similarity), strong distinctiveness (60.8% higher than the best baseline), high fidelity (accuracy degradation no greater than 0.6%), robustness (resilience against four potential attack categories), and high computational efficiency (up to 98.1% cost saving). Our code is available at GitHub (https://github.com/LianPing-cyber/PromptCOS).

Paper Structure

This paper contains 32 sections, 18 equations, 6 figures, 6 tables, 2 algorithms.

Figures (6)

  • Figure 1: Illustration of prompt leakage. To develop an LLM-based application, the prompt owner designs a high-quality system prompt requiring significant computational resources and expert data. However, adversaries may steal the prompt through illegal prompt deal promptmarket2025promptbase or data leakage agarwal2024promptleakageyang2025prsa to develop competitive applications, harming the interests of the prompt owner.
  • Figure 2: Fidelity measures the consistency of responses to standard queries, comparing outputs with and without embedded watermarks. Distinctiveness evaluates the system's ability to distinguish between benign and adversarial applications when presented with a verification query.
  • Figure 3: PromptCOS protects system prompts through two phases: (1) Watermark Embedding.PromptCOS defines optimization objectives with three key designs, including cyclic output signals (effectiveness), auxiliary tokens (fidelity), and cover tokens (robustness), and exploits an alternating optimization algorithm to optimize the watermark components: watermarked prompt, verification query, and signal mark. (2) Copyright Verification.PromptCOS audits suspicious applications by submitting the verification query, segmenting their outputs with a sliding window, and measuring similarity against the signal mark. If the maximum similarity exceeds a predefined threshold, the application is deemed to have misappropriated the prompt.
  • Figure 4: The single signal limits the position and timing of signal mark generation, making it susceptible to uncertain token sampling. If the LLM misses the correct position, it becomes difficult to successfully generate the signal mark later. In contrast, a cyclic signal allows the LLM to generate the signal mark multiple times, thereby increasing the overall probability of signal mark generation and improving the effectiveness of the watermark method.
  • Figure 5: Watermarked system prompts' performance under different attacks. The four attack strategies are represented as RED (redundancy), CON (constraint), OPT (re-optimizatin), and DEL (auxiliary token deletion). The 0 on the y-axis represents the baseline value exhibited by the watermarked prompt when no attacks are applied.
  • ...and 1 more figures