Table of Contents
Fetching ...

On the Weaknesses of Backdoor-based Model Watermarking: An Information-theoretic Perspective

Aoting Hu, Yanzhi Chen, Renjie Xie, Adrian Weller

TL;DR

This work analyzes why backdoor-based model watermarking is vulnerable to watermark erasure from an information-theoretic standpoint, showing that using out-of-distribution trigger sets creates exploitable gaps between trigger and normal data distributions. It introduces In-distribution Watermark Embedding (IWE), which couples the main task and watermark task by designing trigger sets from in-distribution data and by reusing redundant logits as watermark carriers, thereby entangling the two objectives. The authors formalize the verification as a hypothesis test on watermark accuracy and provide a security analysis against trigger-set forgery and logit-based attacks, complemented by empirical results on CIFAR-10/100 and Caltech-101 that demonstrate negligible main-task degradation and robust watermark protection under both black-box and white-box adversaries. The method also reveals a link between watermarking and membership inference, offering a principled path to embedding ownership signals directly in model outputs while preserving utility. Overall, IWE delivers a strong, scalable defense against watermark erasure with minimal impact on performance, and points to future extensions for broader data modalities and large-language-model contexts.

Abstract

Safeguarding the intellectual property of machine learning models has emerged as a pressing concern in AI security. Model watermarking is a powerful technique for protecting ownership of machine learning models, yet its reliability has been recently challenged by recent watermark removal attacks. In this work, we investigate why existing watermark embedding techniques particularly those based on backdooring are vulnerable. Through an information-theoretic analysis, we show that the resilience of watermarking against erasure attacks hinges on the choice of trigger-set samples, where current uses of out-distribution trigger-set are inherently vulnerable to white-box adversaries. Based on this discovery, we propose a novel model watermarking scheme, In-distribution Watermark Embedding (IWE), to overcome the limitations of existing method. To further minimise the gap to clean models, we analyze the role of logits as watermark information carriers and propose a new approach to better conceal watermark information within the logits. Experiments on real-world datasets including CIFAR-100 and Caltech-101 demonstrate that our method robustly defends against various adversaries with negligible accuracy loss (< 0.1%).

On the Weaknesses of Backdoor-based Model Watermarking: An Information-theoretic Perspective

TL;DR

This work analyzes why backdoor-based model watermarking is vulnerable to watermark erasure from an information-theoretic standpoint, showing that using out-of-distribution trigger sets creates exploitable gaps between trigger and normal data distributions. It introduces In-distribution Watermark Embedding (IWE), which couples the main task and watermark task by designing trigger sets from in-distribution data and by reusing redundant logits as watermark carriers, thereby entangling the two objectives. The authors formalize the verification as a hypothesis test on watermark accuracy and provide a security analysis against trigger-set forgery and logit-based attacks, complemented by empirical results on CIFAR-10/100 and Caltech-101 that demonstrate negligible main-task degradation and robust watermark protection under both black-box and white-box adversaries. The method also reveals a link between watermarking and membership inference, offering a principled path to embedding ownership signals directly in model outputs while preserving utility. Overall, IWE delivers a strong, scalable defense against watermark erasure with minimal impact on performance, and points to future extensions for broader data modalities and large-language-model contexts.

Abstract

Safeguarding the intellectual property of machine learning models has emerged as a pressing concern in AI security. Model watermarking is a powerful technique for protecting ownership of machine learning models, yet its reliability has been recently challenged by recent watermark removal attacks. In this work, we investigate why existing watermark embedding techniques particularly those based on backdooring are vulnerable. Through an information-theoretic analysis, we show that the resilience of watermarking against erasure attacks hinges on the choice of trigger-set samples, where current uses of out-distribution trigger-set are inherently vulnerable to white-box adversaries. Based on this discovery, we propose a novel model watermarking scheme, In-distribution Watermark Embedding (IWE), to overcome the limitations of existing method. To further minimise the gap to clean models, we analyze the role of logits as watermark information carriers and propose a new approach to better conceal watermark information within the logits. Experiments on real-world datasets including CIFAR-100 and Caltech-101 demonstrate that our method robustly defends against various adversaries with negligible accuracy loss (< 0.1%).
Paper Structure (22 sections, 1 theorem, 22 equations, 7 figures, 6 tables, 2 algorithms)

This paper contains 22 sections, 1 theorem, 22 equations, 7 figures, 6 tables, 2 algorithms.

Key Result

Theorem 1

Consider random variables $X, X_W \in \mathbb{R}^D$ and $Y, Y_W \in \mathbb{R}$. As $KL[p(X)\|p(X_W)] \to \infty$, there exist infinitely many functions $g^*$ such that $g^*= \arg\max_{g} I(g(X); Y)$ and $I(g^*(X_W); Y_W) = 0$.

Figures (7)

  • Figure 1: Examples of trigger set for watermark embedding. (a), (b), and (c) are the trigger set used in this work based on in-distribution images. Specifically, (b) rotates the image in (a) and (c) changes the color in (a). (d) presents the trigger set that use out-of-distribution images follows adi2018turning.
  • Figure 2: Heatmap of inner activation in deep neural networks for in-distribution and out-of-distribution trigger sets. The darker colors indicates larger activations.
  • Figure 3: An overview of the proposed IWE method. When computing the logits of the watermark task (WT), i.e., the purple double circles in the figure, we map the redundant logits (e.g., the non-top-2 logits of the main task, represented by the grey double circles) to WT logits. This mapping process is controlled by a partition key only known to the model owner. During ownership verification, only the logits, the partition key, the trigger set and the computation graph of the model will be exposed.
  • Figure 4: The distribution of watermark accuracy (i.e., $\rm{ACC_W}$) for clean model and IWE watermarked model. The curve 'Clean' represents the case of a clean model (i.e., no watermark) and the other curves represent the cases when a IWE watermarked model is under attack. A large separation between the curves in the two cases indicates a strong defense. The dotted vertical line represents the threshold $t$ in Algorithm \ref{['alg:ver']}, which corresponds to a significance level of 5% (so that FPR = $5\%$).
  • Figure 5: Main task accuracy and watermark accuracy as a function of the percentage of pruned neurons in proposed IWE method. (a) and (b) show the cases for the fine-pruning (FP) attack liu2018fine whereas (c) and (d) show the case for the adversarial neuron pruning (ANP) attack wu2021adversarial. As more neurons are pruned, the main task accuracy and the watermark accuracy drops simultaneously, suggesting that the neurons used by the two tasks largely overlap with each other.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Theorem 1: Limitations of out-of-distribution schemes
  • proof
  • Definition 1: Redundant logits
  • Definition 2: Partition key