Table of Contents
Fetching ...

Vision-Language Model IP Protection via Prompt-based Learning

Lianyu Wang, Meng Wang, Huazhu Fu, Daoqiang Zhang

TL;DR

Vision-language models such as CLIP offer strong cross-domain recognition but raise IP protection concerns as well-trained backbones can be misused or transferred to unauthorized domains. The paper proposes IP-CLIP, a lightweight, prompt-based protection framework that keeps the CLIP backbone frozen while learning an IP-Prompt to signal domain authorization through domain-specific tokens and image tokens, complemented by a style-enhancement branch with domain feature banks. Two training regimes are introduced: target-specified IP-CLIP, which enforces separation between authorized and unauthorized domain representations using a suite of losses, and target-free IP-CLIP, which relies on style augmentation to simulate unauthorized data without additional data generation or full fine-tuning. The approach is validated on Office-31, Office-Home, and Mini-DomainNet, showing that IP-CLIP achieves higher domain non-transferability (higher $W_{ua}$) while preserving authorized-domain accuracy, outperforming CNN-based IP protection baselines. Overall, IP-CLIP provides a practical, scalable solution for IP protection in VLMs with potential extensions to broader model architectures and downstream tasks, enabling safer deployment of large-scale VLMs in real-world settings.

Abstract

Vision-language models (VLMs) like CLIP (Contrastive Language-Image Pre-Training) have seen remarkable success in visual recognition, highlighting the increasing need to safeguard the intellectual property (IP) of well-trained models. Effective IP protection extends beyond ensuring authorized usage; it also necessitates restricting model deployment to authorized data domains, particularly when the model is fine-tuned for specific target domains. However, current IP protection methods often rely solely on the visual backbone, which may lack sufficient semantic richness. To bridge this gap, we introduce IP-CLIP, a lightweight IP protection strategy tailored to CLIP, employing a prompt-based learning approach. By leveraging the frozen visual backbone of CLIP, we extract both image style and content information, incorporating them into the learning of IP prompt. This strategy acts as a robust barrier, effectively preventing the unauthorized transfer of features from authorized domains to unauthorized ones. Additionally, we propose a style-enhancement branch that constructs feature banks for both authorized and unauthorized domains. This branch integrates self-enhanced and cross-domain features, further strengthening IP-CLIP's capability to block features from unauthorized domains. Finally, we present new three metrics designed to better balance the performance degradation of authorized and unauthorized domains. Comprehensive experiments in various scenarios demonstrate its promising potential for application in IP protection tasks for VLMs.

Vision-Language Model IP Protection via Prompt-based Learning

TL;DR

Vision-language models such as CLIP offer strong cross-domain recognition but raise IP protection concerns as well-trained backbones can be misused or transferred to unauthorized domains. The paper proposes IP-CLIP, a lightweight, prompt-based protection framework that keeps the CLIP backbone frozen while learning an IP-Prompt to signal domain authorization through domain-specific tokens and image tokens, complemented by a style-enhancement branch with domain feature banks. Two training regimes are introduced: target-specified IP-CLIP, which enforces separation between authorized and unauthorized domain representations using a suite of losses, and target-free IP-CLIP, which relies on style augmentation to simulate unauthorized data without additional data generation or full fine-tuning. The approach is validated on Office-31, Office-Home, and Mini-DomainNet, showing that IP-CLIP achieves higher domain non-transferability (higher ) while preserving authorized-domain accuracy, outperforming CNN-based IP protection baselines. Overall, IP-CLIP provides a practical, scalable solution for IP protection in VLMs with potential extensions to broader model architectures and downstream tasks, enabling safer deployment of large-scale VLMs in real-world settings.

Abstract

Vision-language models (VLMs) like CLIP (Contrastive Language-Image Pre-Training) have seen remarkable success in visual recognition, highlighting the increasing need to safeguard the intellectual property (IP) of well-trained models. Effective IP protection extends beyond ensuring authorized usage; it also necessitates restricting model deployment to authorized data domains, particularly when the model is fine-tuned for specific target domains. However, current IP protection methods often rely solely on the visual backbone, which may lack sufficient semantic richness. To bridge this gap, we introduce IP-CLIP, a lightweight IP protection strategy tailored to CLIP, employing a prompt-based learning approach. By leveraging the frozen visual backbone of CLIP, we extract both image style and content information, incorporating them into the learning of IP prompt. This strategy acts as a robust barrier, effectively preventing the unauthorized transfer of features from authorized domains to unauthorized ones. Additionally, we propose a style-enhancement branch that constructs feature banks for both authorized and unauthorized domains. This branch integrates self-enhanced and cross-domain features, further strengthening IP-CLIP's capability to block features from unauthorized domains. Finally, we present new three metrics designed to better balance the performance degradation of authorized and unauthorized domains. Comprehensive experiments in various scenarios demonstrate its promising potential for application in IP protection tasks for VLMs.

Paper Structure

This paper contains 17 sections, 15 equations, 4 figures, 54 tables.

Figures (4)

  • Figure 1: Illustration of model IP protection with IP-CLIP. Domain and image tokens form the IP-Prompt, which a CLIP-based model audits to verify data origin. This prevents unauthorized transfers and degrades performance in unauthorized domains. Notably, IP-Prompt is a lightweight, plug-and-play module for CLIP-based models.
  • Figure 2: (a) The architecture of IP-CLIP is based on a frozen CLIP backbone, where snowflakes denote frozen layers and sparks represent trainable layers. During training, inputs from both the authorized domain $x_a$ and unauthorized domain $x_u$ are fed into the frozen CLIP visual encoder in parallel to generate feature vectors $f_v^a$ and $f_v^u$. The IP projector extracts domain tokens and image tokens from the visual encoder, which are then used to construct prompts as inputs to the text encoder. The style enhancement branch takes the frozen feature bank and $f_v^a$ as input, with $s_v$ representing the enhanced visual features. The prediction result is derived by calculating the similarity between the visual feature $s_v$/$f_v$ and the text feature $f_t$. $y$ and $\mathcal{L}$ represent the label and loss function, respectively. (b) The Inference process of IP-CLIP. (c) Structure of ${Prompt}_a$ and ${Prompt}_u$. (d) Construction of Feature bank $B_a$ and $B_u$, where $D$ and $F$ represent the input dataset and its corresponding visual feature set, respectively. During training, the feature banks remain frozen. (e) Structure of STAM.
  • Figure 3: Several visualization examples of CLIP and IP-CLIP prediction results. Correct predictions are highlighted in green, while incorrect predictions are shown in red.
  • Figure :