Table of Contents
Fetching ...

Neural Dehydration: Effective Erasure of Black-box Watermarks from DNNs with Limited Data

Yifan Lu, Wenxuan Li, Mi Zhang, Xudong Pan, Min Yang

TL;DR

This work tackles the practical threat of protecting deep neural networks with black-box watermarks by introducing Dehydra, a watermark-agnostic removal framework that operates with limited data. It recovers watermark data from the protected model’s internals via aggressive model inversion and then unlearns them during finetuning, augmented by target-class detection and recovered-sample splitting to preserve model utility. The approach demonstrates strong removal performance across ten mainstream black-box watermark schemes on multiple datasets and architectures, often maintaining at least 90% of the original utility and even achieving data-free removal for fixed-class watermarks. The findings highlight the need for robust watermark defenses and suggest that data-efficient, model-centric attacks can effectively neutralize current watermark designs, underscoring practical implications for watermark security and defense strategies.

Abstract

To protect the intellectual property of well-trained deep neural networks (DNNs), black-box watermarks, which are embedded into the prediction behavior of DNN models on a set of specially-crafted samples and extracted from suspect models using only API access, have gained increasing popularity in both academy and industry. Watermark robustness is usually implemented against attackers who steal the protected model and obfuscate its parameters for watermark removal. However, current robustness evaluations are primarily performed under moderate attacks or unrealistic settings. Existing removal attacks could only crack a small subset of the mainstream black-box watermarks, and fall short in four key aspects: incomplete removal, reliance on prior knowledge of the watermark, performance degradation, and high dependency on data. In this paper, we propose a watermark-agnostic removal attack called \textsc{Neural Dehydration} (\textit{abbrev.} \textsc{Dehydra}), which effectively erases all ten mainstream black-box watermarks from DNNs, with only limited or even no data dependence. In general, our attack pipeline exploits the internals of the protected model to recover and unlearn the watermark message. We further design target class detection and recovered sample splitting algorithms to reduce the utility loss and achieve data-free watermark removal on five of the watermarking schemes. We conduct comprehensive evaluation of \textsc{Dehydra} against ten mainstream black-box watermarks on three benchmark datasets and DNN architectures. Compared with existing removal attacks, \textsc{Dehydra} achieves strong removal effectiveness across all the covered watermarks, preserving at least $90\%$ of the stolen model utility, under the data-limited settings, i.e., less than $2\%$ of the training data or even data-free.

Neural Dehydration: Effective Erasure of Black-box Watermarks from DNNs with Limited Data

TL;DR

This work tackles the practical threat of protecting deep neural networks with black-box watermarks by introducing Dehydra, a watermark-agnostic removal framework that operates with limited data. It recovers watermark data from the protected model’s internals via aggressive model inversion and then unlearns them during finetuning, augmented by target-class detection and recovered-sample splitting to preserve model utility. The approach demonstrates strong removal performance across ten mainstream black-box watermark schemes on multiple datasets and architectures, often maintaining at least 90% of the original utility and even achieving data-free removal for fixed-class watermarks. The findings highlight the need for robust watermark defenses and suggest that data-efficient, model-centric attacks can effectively neutralize current watermark designs, underscoring practical implications for watermark security and defense strategies.

Abstract

To protect the intellectual property of well-trained deep neural networks (DNNs), black-box watermarks, which are embedded into the prediction behavior of DNN models on a set of specially-crafted samples and extracted from suspect models using only API access, have gained increasing popularity in both academy and industry. Watermark robustness is usually implemented against attackers who steal the protected model and obfuscate its parameters for watermark removal. However, current robustness evaluations are primarily performed under moderate attacks or unrealistic settings. Existing removal attacks could only crack a small subset of the mainstream black-box watermarks, and fall short in four key aspects: incomplete removal, reliance on prior knowledge of the watermark, performance degradation, and high dependency on data. In this paper, we propose a watermark-agnostic removal attack called \textsc{Neural Dehydration} (\textit{abbrev.} \textsc{Dehydra}), which effectively erases all ten mainstream black-box watermarks from DNNs, with only limited or even no data dependence. In general, our attack pipeline exploits the internals of the protected model to recover and unlearn the watermark message. We further design target class detection and recovered sample splitting algorithms to reduce the utility loss and achieve data-free watermark removal on five of the watermarking schemes. We conduct comprehensive evaluation of \textsc{Dehydra} against ten mainstream black-box watermarks on three benchmark datasets and DNN architectures. Compared with existing removal attacks, \textsc{Dehydra} achieves strong removal effectiveness across all the covered watermarks, preserving at least of the stolen model utility, under the data-limited settings, i.e., less than of the training data or even data-free.
Paper Structure (44 sections, 4 theorems, 17 equations, 8 figures, 18 tables, 1 algorithm)

This paper contains 44 sections, 4 theorems, 17 equations, 8 figures, 18 tables, 1 algorithm.

Key Result

Theorem D.1

Consider the optimal linear model $f^*(x) = \text{sign} (\langle \boldsymbol{w^*}, \boldsymbol{x} \rangle +b^*)$ obtained by minimizing the misclassification risk on $\mathcal{D}_{nor} \cup p\mathcal{D}_{wm}$. If the watermark data satisfies (1) $\frac{p}{0.5+p}(\sigma_{wm}^2-\sigma^2) + \frac{0.5p}

Figures (8)

  • Figure 1: An illustration of our threat model. The model owner trains a watermarked model $f_w$ with a secret set of watermark samples, and hopes to perform ownership verification via API queries for suspect models. On the other hand, the attacker has managed to acquire the white-box $f_w$, and aims to derive a surrogate model $f_a$ with the watermark removed, evading subsequent ownership verification.
  • Figure 2: The overview of our Dehydra. The upper region shows our basic attack, comprising two stages, watermark recovering (§\ref{['sec:recover']}), which reconstructs batch samples $\{B_c\}_{c=0}^{C-1}$ close to real watermark data from each class, and watermark unlearning (§\ref{['sec:unlearn']}), which unlearns the recovered samples during finetuning, along with the auxiliary dataset $X_{aux}$. The lower region shows our improved designs, target class detection (§\ref{['sec:recover2']}) and recovered samples splitting (§\ref{['sec:unlearn2']}). The former detects the target classes after recovering (e.g., $C_6$ in this case), and only retains batches $\{B_c\}_{\ast}$ from those classes. The latter performs class-wise splitting on each $B_c$ before unlearning, to identify samples closer to watermark or normal data, i.e., $\{B_{c,wmk}\}_\ast$ and $\{B_{c,nor}\}_\ast$, respectively.
  • Figure 3: An illustration of the watermark smoothness discrepancy hypothesis. Left: A clean model with its decision boundary. Right: A fixed-class watermark is embedded, with all watermark samples labeled to Class A.
  • Figure 4: Pilot study results of the improved designs. (a) Class-wise smoothness analysis of the five fixed-class and five non-fixed-class watermarks on CIFAR-10. (b) Activations visualization of a model protected by the Content watermark with target class 6. (c) Distribution of sample contributions per class, on the salient neurons of the recovered samples $B_6$.
  • Figure 5: SmoothAcc gap distribution under ten random tests. The box plot shows min/max and quartiles, as well as the estimated outliers.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem D.1
  • Lemma D.2
  • Lemma D.3
  • Theorem D.4