Elevating Defenses: Bridging Adversarial Training and Watermarking for Model Resilience

Janvi Thakkar; Giulio Zizzo; Sergio Maffeis

Elevating Defenses: Bridging Adversarial Training and Watermarking for Model Resilience

Janvi Thakkar, Giulio Zizzo, Sergio Maffeis

TL;DR

The paper tackles the conflicting interaction between adversarial training and watermarking by introducing adversarial watermarks generated with a higher perturbation budget $\beta$ (where $\beta > \epsilon + \alpha$) to co-exist with adversarial training. This design yields a robust watermarked model that maintains strong evasion resistance and reliable ownership verification, demonstrated on MNIST and Fashion-MNIST under black-box, grey-box, and white-box model stealing and removal attacks. The authors show that adversarial watermarks preserve robustness better than OOD watermarks, with high watermark transferability and resilience to pruning and fine-tuning, thereby offering practical ownership protection without sacrificing defensive performance. The work advances defense-in-depth by harmonizing two previously conflicting protections, enabling practical deployment for secure model ownership in real-world scenarios.

Abstract

Machine learning models are being used in an increasing number of critical applications; thus, securing their integrity and ownership is critical. Recent studies observed that adversarial training and watermarking have a conflicting interaction. This work introduces a novel framework to integrate adversarial training with watermarking techniques to fortify against evasion attacks and provide confident model verification in case of intellectual property theft. We use adversarial training together with adversarial watermarks to train a robust watermarked model. The key intuition is to use a higher perturbation budget to generate adversarial watermarks compared to the budget used for adversarial training, thus avoiding conflict. We use the MNIST and Fashion-MNIST datasets to evaluate our proposed technique on various model stealing attacks. The results obtained consistently outperform the existing baseline in terms of robustness performance and further prove the resilience of this defense against pruning and fine-tuning removal attacks.

Elevating Defenses: Bridging Adversarial Training and Watermarking for Model Resilience

TL;DR

The paper tackles the conflicting interaction between adversarial training and watermarking by introducing adversarial watermarks generated with a higher perturbation budget

(where

) to co-exist with adversarial training. This design yields a robust watermarked model that maintains strong evasion resistance and reliable ownership verification, demonstrated on MNIST and Fashion-MNIST under black-box, grey-box, and white-box model stealing and removal attacks. The authors show that adversarial watermarks preserve robustness better than OOD watermarks, with high watermark transferability and resilience to pruning and fine-tuning, thereby offering practical ownership protection without sacrificing defensive performance. The work advances defense-in-depth by harmonizing two previously conflicting protections, enabling practical deployment for secure model ownership in real-world scenarios.

Abstract

Paper Structure (23 sections, 6 figures, 6 tables, 3 algorithms)

This paper contains 23 sections, 6 figures, 6 tables, 3 algorithms.

Introduction
Related Work
Approach
Baseline
Proposed Technique
Experimental Setting
Dataset, Model and Metric
Model Stealing Techniques
Removal Attack: Pruning and Fine-tuning
Results and Analysis
Performance
Robustness in Black-box Setting
With respect to Pruning
With respect to Fine-tuning
Robustness in Grey-box Setting
...and 8 more sections

Figures (6)

Figure 1: The figure shows the distribution of adversarial samples and the adversarial watermarks, with different perturbation budgets.
Figure 2: Impact of removal attack on the model stolen using black-box setting
Figure 3: Impact of removal attack on the model stolen using grey-box setting
Figure 4: Impact of removal attack on the model stolen using white-box setting
Figure 5: Above figure plots the distribution of datasets for MNIST dataset, including its training dataset, adversarial training dataset and watermarking set (OOD - for baseline, and adversarial watermarks for the proposed use case). The OOD dataset used for the baseline is Fashion-MNIST dataset.
...and 1 more figures

Elevating Defenses: Bridging Adversarial Training and Watermarking for Model Resilience

TL;DR

Abstract

Elevating Defenses: Bridging Adversarial Training and Watermarking for Model Resilience

Authors

TL;DR

Abstract

Table of Contents

Figures (6)