BlockDoor: Blocking Backdoor Based Watermarks in Deep Neural Networks
Yi Hao Puah, Anh Tu Ngo, Nandish Chattopadhyay, Anupam Chattopadhyay
TL;DR
The paper addresses the vulnerability of neural network watermarking schemes that rely on backdoors by introducing BlockDoor, a wrapper-based framework that detects and neutralizes Trigger samples from adversarial, OOD, and random-label families without significantly harming clean-task performance. It implements three specialized wrappers: an adversarial-detection wrapper with a two-step verification, an OOD-detection wrapper, and a random-label wrapper that uses a partially trained model for feature extraction followed by an SVM (with Hungarian assignment) for label verification. Empirical results across multiple architectures (e.g., ResNet18, ViT, MobileNet, VGG16) and datasets demonstrate substantial reductions in watermark verification accuracy—up to about 98% in some settings—while preserving test accuracy typically within a few percentage points of the original model. The work highlights critical vulnerabilities in watermarking via backdoors and provides practical, scalable defenses that can deter watermark-based ownership verification, with implications for IP protection and model theft prevention in real-world deployments.
Abstract
Adoption of machine learning models across industries have turned Neural Networks (DNNs) into a prized Intellectual Property (IP), which needs to be protected from being stolen or being used without authorization. This topic gave rise to multiple watermarking schemes, through which, one can establish the ownership of a model. Watermarking using backdooring is the most well established method available in the literature, with specific works demonstrating the difficulty in removing the watermarks, embedded as backdoors within the weights of the network. However, in our work, we have identified a critical flaw in the design of the watermark verification with backdoors, pertaining to the behaviour of the samples of the Trigger Set, which acts as the secret key. In this paper, we present BlockDoor, which is a comprehensive package of techniques that is used as a wrapper to block all three different kinds of Trigger samples, which are used in the literature as means to embed watermarks within the trained neural networks as backdoors. The framework implemented through BlockDoor is able to detect potential Trigger samples, through separate functions for adversarial noise based triggers, out-of-distribution triggers and random label based triggers. Apart from a simple Denial-of-Service for a potential Trigger sample, our approach is also able to modify the Trigger samples for correct machine learning functionality. Extensive evaluation of BlockDoor establishes that it is able to significantly reduce the watermark validation accuracy of the Trigger set by up to $98\%$ without compromising on functionality, delivering up to a less than $1\%$ drop on the clean samples. BlockDoor has been tested on multiple datasets and neural architectures.
