On the Weaknesses of Backdoor-based Model Watermarking: An Information-theoretic Perspective
Aoting Hu, Yanzhi Chen, Renjie Xie, Adrian Weller
TL;DR
This work analyzes why backdoor-based model watermarking is vulnerable to watermark erasure from an information-theoretic standpoint, showing that using out-of-distribution trigger sets creates exploitable gaps between trigger and normal data distributions. It introduces In-distribution Watermark Embedding (IWE), which couples the main task and watermark task by designing trigger sets from in-distribution data and by reusing redundant logits as watermark carriers, thereby entangling the two objectives. The authors formalize the verification as a hypothesis test on watermark accuracy and provide a security analysis against trigger-set forgery and logit-based attacks, complemented by empirical results on CIFAR-10/100 and Caltech-101 that demonstrate negligible main-task degradation and robust watermark protection under both black-box and white-box adversaries. The method also reveals a link between watermarking and membership inference, offering a principled path to embedding ownership signals directly in model outputs while preserving utility. Overall, IWE delivers a strong, scalable defense against watermark erasure with minimal impact on performance, and points to future extensions for broader data modalities and large-language-model contexts.
Abstract
Safeguarding the intellectual property of machine learning models has emerged as a pressing concern in AI security. Model watermarking is a powerful technique for protecting ownership of machine learning models, yet its reliability has been recently challenged by recent watermark removal attacks. In this work, we investigate why existing watermark embedding techniques particularly those based on backdooring are vulnerable. Through an information-theoretic analysis, we show that the resilience of watermarking against erasure attacks hinges on the choice of trigger-set samples, where current uses of out-distribution trigger-set are inherently vulnerable to white-box adversaries. Based on this discovery, we propose a novel model watermarking scheme, In-distribution Watermark Embedding (IWE), to overcome the limitations of existing method. To further minimise the gap to clean models, we analyze the role of logits as watermark information carriers and propose a new approach to better conceal watermark information within the logits. Experiments on real-world datasets including CIFAR-100 and Caltech-101 demonstrate that our method robustly defends against various adversaries with negligible accuracy loss (< 0.1%).
