Table of Contents
Fetching ...

SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking

Wenyuan Yang, Yichen Sun, Changzheng Chen, Zhixuan Chu, Jiaheng Zhang, Yiming Li, Dacheng Tao

TL;DR

The paper tackles copyright protection for soft prompts used with CLIP by arguing that traditional non-intrusive auditing risks false positives and intrusive backdoor watermarks struggle due to the small prompt space. It introduces SWAP, a sequential watermarking scheme that embeds the watermark in a higher-complexity probability-ordering space defined over defender-specified verification classes, preserving model utility. A hypothesis-test-based ownership verification protocol with theoretical success conditions is proposed, and extensive experiments on 11 datasets demonstrate SWAP’s effectiveness, harmlessness, and robustness against adaptive attacks. The approach enables reliable, black-box verification of prompt ownership with strong practical impact for IP protection in prompt-tuning ecosystems.

Abstract

Large-scale vision-language models, especially CLIP, have demonstrated remarkable performance across diverse downstream tasks. Soft prompts, as carefully crafted modules that efficiently adapt vision-language models to specific tasks, necessitate effective copyright protection. In this paper, we investigate model copyright protection by auditing whether suspicious third-party models incorporate protected soft prompts. While this can be viewed as a special case of model ownership auditing, our analysis shows that existing techniques are ineffective due to prompt learning's unique characteristics. Non-intrusive auditing is inherently prone to false positives when independent models share similar data distributions with victim models. Intrusive approaches also fail: backdoor methods designed for CLIP cannot embed functional triggers, while extending traditional DNN backdoor techniques to prompt learning suffers from harmfulness and ambiguity challenges. We find that these failures in intrusive auditing stem from the same fundamental reason: watermarking operates within the same decision space as the primary task yet pursues opposing objectives. Motivated by these findings, we propose sequential watermarking for soft prompts (SWAP), which implants watermarks into a different and more complex space. SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes, inspired by the zero-shot prediction capability of CLIP. This watermark, which is embedded in a more complex space, keeps the original prediction label unchanged, making it less opposed to the primary task. We further design a hypothesis-test-guided verification protocol for SWAP and provide theoretical analyses of success conditions. Extensive experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential adaptive attacks.

SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking

TL;DR

The paper tackles copyright protection for soft prompts used with CLIP by arguing that traditional non-intrusive auditing risks false positives and intrusive backdoor watermarks struggle due to the small prompt space. It introduces SWAP, a sequential watermarking scheme that embeds the watermark in a higher-complexity probability-ordering space defined over defender-specified verification classes, preserving model utility. A hypothesis-test-based ownership verification protocol with theoretical success conditions is proposed, and extensive experiments on 11 datasets demonstrate SWAP’s effectiveness, harmlessness, and robustness against adaptive attacks. The approach enables reliable, black-box verification of prompt ownership with strong practical impact for IP protection in prompt-tuning ecosystems.

Abstract

Large-scale vision-language models, especially CLIP, have demonstrated remarkable performance across diverse downstream tasks. Soft prompts, as carefully crafted modules that efficiently adapt vision-language models to specific tasks, necessitate effective copyright protection. In this paper, we investigate model copyright protection by auditing whether suspicious third-party models incorporate protected soft prompts. While this can be viewed as a special case of model ownership auditing, our analysis shows that existing techniques are ineffective due to prompt learning's unique characteristics. Non-intrusive auditing is inherently prone to false positives when independent models share similar data distributions with victim models. Intrusive approaches also fail: backdoor methods designed for CLIP cannot embed functional triggers, while extending traditional DNN backdoor techniques to prompt learning suffers from harmfulness and ambiguity challenges. We find that these failures in intrusive auditing stem from the same fundamental reason: watermarking operates within the same decision space as the primary task yet pursues opposing objectives. Motivated by these findings, we propose sequential watermarking for soft prompts (SWAP), which implants watermarks into a different and more complex space. SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes, inspired by the zero-shot prediction capability of CLIP. This watermark, which is embedded in a more complex space, keeps the original prediction label unchanged, making it less opposed to the primary task. We further design a hypothesis-test-guided verification protocol for SWAP and provide theoretical analyses of success conditions. Extensive experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential adaptive attacks.

Paper Structure

This paper contains 27 sections, 4 theorems, 20 equations, 9 figures, 14 tables, 1 algorithm.

Key Result

Proposition 1

Let $\pi(p)$ be the sequence extracted from the suspicious model and $\pi_{o}(\mathcal{T})$ is the defender-specified sequence. Given the null hypothesis: $H_0: d(\pi(p), \pi_{o}(\mathcal{T})) = \tau$ and the alternative hypothesis: $H_1: d(\pi(p), \pi_{o}(\mathcal{T})) < \tau$, where $\tau$ is a th

Figures (9)

  • Figure 1: The comparison between the backdoor-based watermarking scheme (i.e., BWAP) and our proposed SWAP for CLIP soft prompts. BWAP determines ownership through induced misclassification, which inevitably alters the model’s predictions. In contrast, SWAP verifies ownership by examining the sequential ordering of additional defender-specified classes rather than changing predictions. This design preserves the model’s utility while enabling reliable ownership verification.
  • Figure 2: Three scenarios (one benign and two malicious) involved in prompt ownership verification. In prompt watermarking, the prompt developer generates a prompt along with its watermark and registers both with a trusted third-party verifier. When the prompt is maliciously reused, the verifier can determine its ownership by comparing the embedded watermark. In watermark removal attacks, a malicious reuser attempts to utilize the prompt while removing or obfuscating the watermark to evade verification, thereby enabling unauthorized use. In false claim attacks, a malicious developer seeks to pre-register a transferable watermark to falsely claim ownership of independently developed soft prompts for CLIP models.
  • Figure 3: (a) Benign accuracy on CoOp and CoCoOp across different datasets, showing the effectiveness of prompt tuning; (b) Attack success rate on CLIP (reference model), CoOp and CoCoOp (victim model), where adversarial examples generated by CLIP achieve high success rate on CLIP, CoOp and CoCoOp. The high transferability of attacks indicates that BWAP are susceptible to false claim attacks.
  • Figure 4: The overall pipeline of SWAP. In the prompt watermarking stage, we introduce $\mathit{n}$ additional verification classes and embed a distinctive watermark by enforcing a predefined sequential ordering among them during prompt tuning. Specifically, the objective $\mathcal{L}_{o}$ applies a hinge-like constraint that maintains a fixed margin $\varepsilon$ between the logits of consecutive verification classes, while $\mathcal{L}_{f}$ preserves the original task performance through a standard cross-entropy loss. In the ownership verification stage, we incorporate the verification classes into the testing classes and examine the predetermined sequential pattern in the output probabilities to verify the ownership.
  • Figure 5: The resistance to fine-tuning Attack (left) and model-pruning attack (right).
  • ...and 4 more figures

Theorems & Definitions (5)

  • Definition 1
  • Proposition 1
  • Theorem 1
  • Theorem 1
  • Proposition 2