Table of Contents
Fetching ...

Neural network fragile watermarking with no model performance degradation

Zhaoxia Yin, Heng Yin, Xinpeng Zhang

TL;DR

This work addresses the challenge of verifying the integrity of pre-trained neural networks against malicious fine-tuning and backdoor attacks without degrading their performance. It introduces a black-box fragile watermarking framework that jointly trains a generative model with a secret key to produce fragile triggers, enabling remote verification via an accuracy-based metric. A variance-based regularization term in the loss encourages diverse, sensitive triggers, and the method demonstrates no loss in base model accuracy while reliably detecting tampering across various ResNet architectures and datasets. The approach offers practical security for model providers and cloud services by enabling tamper detection without requiring access to model internals.

Abstract

Deep neural networks are vulnerable to malicious fine-tuning attacks such as data poisoning and backdoor attacks. Therefore, in recent research, it is proposed how to detect malicious fine-tuning of neural network models. However, it usually negatively affects the performance of the protected model. Thus, we propose a novel neural network fragile watermarking with no model performance degradation. In the process of watermarking, we train a generative model with the specific loss function and secret key to generate triggers that are sensitive to the fine-tuning of the target classifier. In the process of verifying, we adopt the watermarked classifier to get labels of each fragile trigger. Then, malicious fine-tuning can be detected by comparing secret keys and labels. Experiments on classic datasets and classifiers show that the proposed method can effectively detect model malicious fine-tuning with no model performance degradation.

Neural network fragile watermarking with no model performance degradation

TL;DR

This work addresses the challenge of verifying the integrity of pre-trained neural networks against malicious fine-tuning and backdoor attacks without degrading their performance. It introduces a black-box fragile watermarking framework that jointly trains a generative model with a secret key to produce fragile triggers, enabling remote verification via an accuracy-based metric. A variance-based regularization term in the loss encourages diverse, sensitive triggers, and the method demonstrates no loss in base model accuracy while reliably detecting tampering across various ResNet architectures and datasets. The approach offers practical security for model providers and cloud services by enabling tamper detection without requiring access to model internals.

Abstract

Deep neural networks are vulnerable to malicious fine-tuning attacks such as data poisoning and backdoor attacks. Therefore, in recent research, it is proposed how to detect malicious fine-tuning of neural network models. However, it usually negatively affects the performance of the protected model. Thus, we propose a novel neural network fragile watermarking with no model performance degradation. In the process of watermarking, we train a generative model with the specific loss function and secret key to generate triggers that are sensitive to the fine-tuning of the target classifier. In the process of verifying, we adopt the watermarked classifier to get labels of each fragile trigger. Then, malicious fine-tuning can be detected by comparing secret keys and labels. Experiments on classic datasets and classifiers show that the proposed method can effectively detect model malicious fine-tuning with no model performance degradation.
Paper Structure (12 sections, 2 equations, 2 figures, 4 tables)

This paper contains 12 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The framework of generating fragile trigger set.
  • Figure 2: The $AccTri$ of generated samples from $G_{non}$, $G_{cla}$, $G_{full}$, and $G_{var}$ in each test epoch. Experimental verifications are carried out in (a) where all model parameters can be modified and (b) where only the last model layer is modified.