Table of Contents
Fetching ...

Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising

Yongli Xiang, Ziming Hong, Lina Yao, Dadong Wang, Tongliang Liu

TL;DR

JailNTL exposes a black-box vulnerability of non-transferable learning by introducing test-time data disguising to jailbreak the non-transferable barrier. The method combines data-intrinsic disguising (DID) and model-guided disguising (MGD), using a GAN-based disguising network, a bidirectional CycleGAN structure, and zero-order gradient estimation to avoid touching model weights. Empirical results show up to 55.7% unauthorized-domain recovery with only 1% authorized data, outperforming white-box baselines and enhancing white-box attacks when integrated with TransNTL. The work underscores the need for secure NTL deployments in black-box settings and offers a flexible framework that can augment existing attack strategies while informing defense design.

Abstract

Non-transferable learning (NTL) has been proposed to protect model intellectual property (IP) by creating a "non-transferable barrier" to restrict generalization from authorized to unauthorized domains. Recently, well-designed attack, which restores the unauthorized-domain performance by fine-tuning NTL models on few authorized samples, highlights the security risks of NTL-based applications. However, such attack requires modifying model weights, thus being invalid in the black-box scenario. This raises a critical question: can we trust the security of NTL models deployed as black-box systems? In this work, we reveal the first loophole of black-box NTL models by proposing a novel attack method (dubbed as JailNTL) to jailbreak the non-transferable barrier through test-time data disguising. The main idea of JailNTL is to disguise unauthorized data so it can be identified as authorized by the NTL model, thereby bypassing the non-transferable barrier without modifying the NTL model weights. Specifically, JailNTL encourages unauthorized-domain disguising in two levels, including: (i) data-intrinsic disguising (DID) for eliminating domain discrepancy and preserving class-related content at the input-level, and (ii) model-guided disguising (MGD) for mitigating output-level statistics difference of the NTL model. Empirically, when attacking state-of-the-art (SOTA) NTL models in the black-box scenario, JailNTL achieves an accuracy increase of up to 55.7% in the unauthorized domain by using only 1% authorized samples, largely exceeding existing SOTA white-box attacks.

Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising

TL;DR

JailNTL exposes a black-box vulnerability of non-transferable learning by introducing test-time data disguising to jailbreak the non-transferable barrier. The method combines data-intrinsic disguising (DID) and model-guided disguising (MGD), using a GAN-based disguising network, a bidirectional CycleGAN structure, and zero-order gradient estimation to avoid touching model weights. Empirical results show up to 55.7% unauthorized-domain recovery with only 1% authorized data, outperforming white-box baselines and enhancing white-box attacks when integrated with TransNTL. The work underscores the need for secure NTL deployments in black-box settings and offers a flexible framework that can augment existing attack strategies while informing defense design.

Abstract

Non-transferable learning (NTL) has been proposed to protect model intellectual property (IP) by creating a "non-transferable barrier" to restrict generalization from authorized to unauthorized domains. Recently, well-designed attack, which restores the unauthorized-domain performance by fine-tuning NTL models on few authorized samples, highlights the security risks of NTL-based applications. However, such attack requires modifying model weights, thus being invalid in the black-box scenario. This raises a critical question: can we trust the security of NTL models deployed as black-box systems? In this work, we reveal the first loophole of black-box NTL models by proposing a novel attack method (dubbed as JailNTL) to jailbreak the non-transferable barrier through test-time data disguising. The main idea of JailNTL is to disguise unauthorized data so it can be identified as authorized by the NTL model, thereby bypassing the non-transferable barrier without modifying the NTL model weights. Specifically, JailNTL encourages unauthorized-domain disguising in two levels, including: (i) data-intrinsic disguising (DID) for eliminating domain discrepancy and preserving class-related content at the input-level, and (ii) model-guided disguising (MGD) for mitigating output-level statistics difference of the NTL model. Empirically, when attacking state-of-the-art (SOTA) NTL models in the black-box scenario, JailNTL achieves an accuracy increase of up to 55.7% in the unauthorized domain by using only 1% authorized samples, largely exceeding existing SOTA white-box attacks.

Paper Structure

This paper contains 50 sections, 16 equations, 13 figures, 6 tables, 3 algorithms.

Figures (13)

  • Figure 1: Comparison of NTL model and attack paradigms. (a) The pre-trained NTL model contains a "non-transferable barrier" to restrict authorized-to-unauthorized generalization. (b) Existing white-box attacks break the non-transferable barrier by modifying the NTL model weights. (c) To enable a feasible black-box attack, our JailNTL aims to disguise unauthorized data so it can be identified as authorized by the NTL model, thereby bypassing the non-transferable barrier without modifying the NTL model weights.
  • Figure 2: JailNTL architecture with (a) data-intrinsic disguising and (b) model-guided disguising. In the diagram, red circles represent the unauthorized domain, green denotes the authorized domain, and blue indicates the disguised domain. Light red and green signifies domains that have undergone feedback processing. $f_d$ represents the disguising model which maps unauthorized domain to its disguised version, while $\hat{f_d}$ performs the inverse mapping. $f_c$ and $\hat{f}_c$ are discriminators for the authorized and unauthorized domains, respectively. Different processes are represented through various colors and line styles, as illustrated in the top-left legend.
  • Figure 3: Statistics differences of NTL models on the authorized (CIFAR-10) and the unauthorized domain (STL-10): (a) prediction confidence, (b) prediction proportions.
  • Figure 4: Attack Phase in JailNTL: Unauthorized domain test data $\mathcal{D}_u$ is input into the disguise model $f_d$, producing disguised domain data $f_d(\mathcal{D}_u)$, which is then fed into the NTL model $f_{ntl}$ to obtain the final prediction $\hat{y}$.
  • Figure 5: Visualization of JailNTL's effect on model attention using GradCAM.
  • ...and 8 more figures