Table of Contents
Fetching ...

Explanation as a Watermark: Towards Harmless and Multi-bit Model Ownership Verification via Watermarking Feature Attribution

Shuo Shao, Yiming Li, Hongwei Yao, Yiling He, Zhan Qin, Kui Ren

TL;DR

A new watermarking paradigm, Explanation as a Watermark (EaaW), that implants verification behaviors into the explanation of feature attribution instead of model predictions, and embeds a `multi-bit' watermark into the feature attribution explanation of specific trigger samples without changing the original prediction.

Abstract

Ownership verification is currently the most critical and widely adopted post-hoc method to safeguard model copyright. In general, model owners exploit it to identify whether a given suspicious third-party model is stolen from them by examining whether it has particular properties `inherited' from their released models. Currently, backdoor-based model watermarks are the primary and cutting-edge methods to implant such properties in the released models. However, backdoor-based methods have two fatal drawbacks, including harmfulness and ambiguity. The former indicates that they introduce maliciously controllable misclassification behaviors ($i.e.$, backdoor) to the watermarked released models. The latter denotes that malicious users can easily pass the verification by finding other misclassified samples, leading to ownership ambiguity. In this paper, we argue that both limitations stem from the `zero-bit' nature of existing watermarking schemes, where they exploit the status ($i.e.$, misclassified) of predictions for verification. Motivated by this understanding, we design a new watermarking paradigm, $i.e.$, Explanation as a Watermark (EaaW), that implants verification behaviors into the explanation of feature attribution instead of model predictions. Specifically, EaaW embeds a `multi-bit' watermark into the feature attribution explanation of specific trigger samples without changing the original prediction. We correspondingly design the watermark embedding and extraction algorithms inspired by explainable artificial intelligence. In particular, our approach can be used for different tasks ($e.g.$, image classification and text generation). Extensive experiments verify the effectiveness and harmlessness of our EaaW and its resistance to potential attacks.

Explanation as a Watermark: Towards Harmless and Multi-bit Model Ownership Verification via Watermarking Feature Attribution

TL;DR

A new watermarking paradigm, Explanation as a Watermark (EaaW), that implants verification behaviors into the explanation of feature attribution instead of model predictions, and embeds a `multi-bit' watermark into the feature attribution explanation of specific trigger samples without changing the original prediction.

Abstract

Ownership verification is currently the most critical and widely adopted post-hoc method to safeguard model copyright. In general, model owners exploit it to identify whether a given suspicious third-party model is stolen from them by examining whether it has particular properties `inherited' from their released models. Currently, backdoor-based model watermarks are the primary and cutting-edge methods to implant such properties in the released models. However, backdoor-based methods have two fatal drawbacks, including harmfulness and ambiguity. The former indicates that they introduce maliciously controllable misclassification behaviors (, backdoor) to the watermarked released models. The latter denotes that malicious users can easily pass the verification by finding other misclassified samples, leading to ownership ambiguity. In this paper, we argue that both limitations stem from the `zero-bit' nature of existing watermarking schemes, where they exploit the status (, misclassified) of predictions for verification. Motivated by this understanding, we design a new watermarking paradigm, , Explanation as a Watermark (EaaW), that implants verification behaviors into the explanation of feature attribution instead of model predictions. Specifically, EaaW embeds a `multi-bit' watermark into the feature attribution explanation of specific trigger samples without changing the original prediction. We correspondingly design the watermark embedding and extraction algorithms inspired by explainable artificial intelligence. In particular, our approach can be used for different tasks (, image classification and text generation). Extensive experiments verify the effectiveness and harmlessness of our EaaW and its resistance to potential attacks.
Paper Structure (52 sections, 2 theorems, 14 equations, 15 figures, 14 tables, 2 algorithms)

This paper contains 52 sections, 2 theorems, 14 equations, 15 figures, 14 tables, 2 algorithms.

Key Result

Proposition 1

Let $\tilde{\bm{\mathcal{W}}}$ be the watermark extracted from the suspicious model, and $\bm{\mathcal{W}}$ is the original watermark. Given the null hypothesis $H_0: \tilde{\bm{\mathcal{W}}}$ is independent of $\bm{\mathcal{W}}$ and the alternative hypothesis $H_1: \tilde{\bm{\mathcal{W}}}$ has an

Figures (15)

  • Figure 1: The main pipeline of our EaaW and backdoor-based methods. Backdoor-based methods depend on the misclassification to determine the ownership. Instead of changing the predictions, our EaaW implants the watermark into the explanation of feature attribution for verification.
  • Figure 2: The main pipeline of the watermark extraction algorithm based on feature attribution. First, we locally sample some masked samples by randomly masking a few basic parts of the trigger sample. Second, we input the masked dataset to get the prediction and calculate the metric vector. Finally, we fit a linear model to evaluate the importance of each basic part in the trigger sample. The sign of the explanation serves as the watermark.
  • Figure 3: The trigger samples (on the upper row) used to watermark image classification models and the corresponding extracted watermark (on the bottom row). The target watermark is shown on the left.
  • Figure 4: Watermark success rate (WSR), the log p-value, and functionality evaluation (test accuracy or PPL) of watermarked ResNet-18 and GPT-2 against fine-tuning attack.
  • Figure 5: Watermark success rate (WSR), the log p-value, and functionality evaluation (test accuracy or PPL) of watermarked ResNet-18 and GPT-2 against model-pruning attack.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Definition 1: Ambiguity Attack
  • Proposition 2
  • proof