CBD: A Certified Backdoor Detector Based on Local Dominant Probability
Zhen Xiang, Zidi Xiong, Bo Li
TL;DR
This work introduces CBD, the first certified backdoor detector, which uses an adjustable conformal-prediction framework built on a novel local dominant probability (LDP) statistic to detect backdoored neural networks without access to training data. By leveraging a small set of benign shadow models and a calibration set, CBD quantifies a p-value that triggers alarms while providing a theoretical guarantee that attacks with stronger trigger robustness and smaller perturbations are detectable. The authors derive a backdoor-detection certification inequality and a probabilistic upper bound on false positives, and validate CBD across four vision datasets with multiple backdoor types, achieving high detection and certification performance with low FPR. The method offers a practical, theoretically grounded approach to post-training backdoor defense, with potential to complement certified robustness and guide future attacks toward detectable regimes.
Abstract
Backdoor attack is a common threat to deep neural networks. During testing, samples embedded with a backdoor trigger will be misclassified as an adversarial target by a backdoored model, while samples without the backdoor trigger will be correctly classified. In this paper, we present the first certified backdoor detector (CBD), which is based on a novel, adjustable conformal prediction scheme based on our proposed statistic local dominant probability. For any classifier under inspection, CBD provides 1) a detection inference, 2) the condition under which the attacks are guaranteed to be detectable for the same classification domain, and 3) a probabilistic upper bound for the false positive rate. Our theoretical results show that attacks with triggers that are more resilient to test-time noise and have smaller perturbation magnitudes are more likely to be detected with guarantees. Moreover, we conduct extensive experiments on four benchmark datasets considering various backdoor types, such as BadNet, CB, and Blend. CBD achieves comparable or even higher detection accuracy than state-of-the-art detectors, and it in addition provides detection certification. Notably, for backdoor attacks with random perturbation triggers bounded by $\ell_2\leq0.75$ which achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\% (84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and TinyImageNet, respectively, with low false positive rates.
