Neural network fragile watermarking with no model performance degradation
Zhaoxia Yin, Heng Yin, Xinpeng Zhang
TL;DR
This work addresses the challenge of verifying the integrity of pre-trained neural networks against malicious fine-tuning and backdoor attacks without degrading their performance. It introduces a black-box fragile watermarking framework that jointly trains a generative model with a secret key to produce fragile triggers, enabling remote verification via an accuracy-based metric. A variance-based regularization term in the loss encourages diverse, sensitive triggers, and the method demonstrates no loss in base model accuracy while reliably detecting tampering across various ResNet architectures and datasets. The approach offers practical security for model providers and cloud services by enabling tamper detection without requiring access to model internals.
Abstract
Deep neural networks are vulnerable to malicious fine-tuning attacks such as data poisoning and backdoor attacks. Therefore, in recent research, it is proposed how to detect malicious fine-tuning of neural network models. However, it usually negatively affects the performance of the protected model. Thus, we propose a novel neural network fragile watermarking with no model performance degradation. In the process of watermarking, we train a generative model with the specific loss function and secret key to generate triggers that are sensitive to the fine-tuning of the target classifier. In the process of verifying, we adopt the watermarked classifier to get labels of each fragile trigger. Then, malicious fine-tuning can be detected by comparing secret keys and labels. Experiments on classic datasets and classifiers show that the proposed method can effectively detect model malicious fine-tuning with no model performance degradation.
