Table of Contents
Fetching ...

CBD: A Certified Backdoor Detector Based on Local Dominant Probability

Zhen Xiang, Zidi Xiong, Bo Li

TL;DR

This work introduces CBD, the first certified backdoor detector, which uses an adjustable conformal-prediction framework built on a novel local dominant probability (LDP) statistic to detect backdoored neural networks without access to training data. By leveraging a small set of benign shadow models and a calibration set, CBD quantifies a p-value that triggers alarms while providing a theoretical guarantee that attacks with stronger trigger robustness and smaller perturbations are detectable. The authors derive a backdoor-detection certification inequality and a probabilistic upper bound on false positives, and validate CBD across four vision datasets with multiple backdoor types, achieving high detection and certification performance with low FPR. The method offers a practical, theoretically grounded approach to post-training backdoor defense, with potential to complement certified robustness and guide future attacks toward detectable regimes.

Abstract

Backdoor attack is a common threat to deep neural networks. During testing, samples embedded with a backdoor trigger will be misclassified as an adversarial target by a backdoored model, while samples without the backdoor trigger will be correctly classified. In this paper, we present the first certified backdoor detector (CBD), which is based on a novel, adjustable conformal prediction scheme based on our proposed statistic local dominant probability. For any classifier under inspection, CBD provides 1) a detection inference, 2) the condition under which the attacks are guaranteed to be detectable for the same classification domain, and 3) a probabilistic upper bound for the false positive rate. Our theoretical results show that attacks with triggers that are more resilient to test-time noise and have smaller perturbation magnitudes are more likely to be detected with guarantees. Moreover, we conduct extensive experiments on four benchmark datasets considering various backdoor types, such as BadNet, CB, and Blend. CBD achieves comparable or even higher detection accuracy than state-of-the-art detectors, and it in addition provides detection certification. Notably, for backdoor attacks with random perturbation triggers bounded by $\ell_2\leq0.75$ which achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\% (84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and TinyImageNet, respectively, with low false positive rates.

CBD: A Certified Backdoor Detector Based on Local Dominant Probability

TL;DR

This work introduces CBD, the first certified backdoor detector, which uses an adjustable conformal-prediction framework built on a novel local dominant probability (LDP) statistic to detect backdoored neural networks without access to training data. By leveraging a small set of benign shadow models and a calibration set, CBD quantifies a p-value that triggers alarms while providing a theoretical guarantee that attacks with stronger trigger robustness and smaller perturbations are detectable. The authors derive a backdoor-detection certification inequality and a probabilistic upper bound on false positives, and validate CBD across four vision datasets with multiple backdoor types, achieving high detection and certification performance with low FPR. The method offers a practical, theoretically grounded approach to post-training backdoor defense, with potential to complement certified robustness and guide future attacks toward detectable regimes.

Abstract

Backdoor attack is a common threat to deep neural networks. During testing, samples embedded with a backdoor trigger will be misclassified as an adversarial target by a backdoored model, while samples without the backdoor trigger will be correctly classified. In this paper, we present the first certified backdoor detector (CBD), which is based on a novel, adjustable conformal prediction scheme based on our proposed statistic local dominant probability. For any classifier under inspection, CBD provides 1) a detection inference, 2) the condition under which the attacks are guaranteed to be detectable for the same classification domain, and 3) a probabilistic upper bound for the false positive rate. Our theoretical results show that attacks with triggers that are more resilient to test-time noise and have smaller perturbation magnitudes are more likely to be detected with guarantees. Moreover, we conduct extensive experiments on four benchmark datasets considering various backdoor types, such as BadNet, CB, and Blend. CBD achieves comparable or even higher detection accuracy than state-of-the-art detectors, and it in addition provides detection certification. Notably, for backdoor attacks with random perturbation triggers bounded by which achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\% (84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and TinyImageNet, respectively, with low false positive rates.
Paper Structure (32 sections, 5 theorems, 22 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 32 sections, 5 theorems, 22 equations, 11 figures, 3 tables, 2 algorithms.

Key Result

Theorem 4.1

(Backdoor Detection Guarantee) For an arbitrary classifier $f(\cdot; w):{\mathcal{X}}\rightarrow{\mathcal{Y}}$ to be inspected, let $x_1, \cdots, x_K$ be the $K$ randomly selected samples and $\mathcal{N}(0, \sigma^2 I)$ be the isotropic Gaussian distribution used to compute the LDP for $f(\cdot; w) where $\Phi$ is the standard Gaussian CDF, $\pi=\min_{k=1,\cdots,K} R_{\delta, t}(x_k|w,\sigma)$ is

Figures (11)

  • Figure 1: Benign classifier with a small LDP close to $\frac{1}{4}$.
  • Figure 2: Classifier being backdoored with a large LDP.
  • Figure 4: Certification performance of CBD against backdoor attacks with random triggers with perturbation magnitude $\ell_2\leq0.75$ measured by CTPR (solid) for a range of $\sigma$ for $\beta=0, 0.1, 0.2$. The CTPRs are all upper-bounded by the TPRs (dashed), showing the correctness of our certification. Notably, CBD achieves up to 98% (100%), 84% (100%), 98% (98%), and 40% (72%) CTPRs (TPRs) on GTSRB, SVHN, CIFAR-10, and TinyImageNet, respectively, across all choices of $\sigma$ and $\beta$. An increment in $\beta$, the assumed ratio of calibration outliers, may lead to further increments in both CTPR and TPR. The hyperparameter $\sigma$ can be determined using the calibration set in practice.
  • Figure 5: Receiver operating characteristic (ROC) curves of CBDsup aggregated over all three trigger types on GTSRB, SVNH, and CIFAR-10, respectively. CBDsup with our proposed LDP statistic achieves higher overall areas under curves (AUCs) than K-Arm and MNTD across the three datasets.
  • Figure 6: Supportive results: (Left) Choice of $\sigma$ for a range of $\psi$ based on our selection scheme, which matches the $\sigma$ choices in Fig. \ref{['fig:exp_certification']} with high CTPR and TPR. (Middle) The histograms of the LDP statistics for the shadow models and the benign models, with the associated empirical CDFs. LDP for benign models is stochastically dominated by the LDP for shadow models for all datasets. (Right) Vulnerability of WaNet attack on GTSRB and SVHN. Attack success rate (ASR) reduces with a negligible drop in benign accuracy (ACC) when inputs are smoothed by Gaussian noise.
  • ...and 6 more figures

Theorems & Definitions (15)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Theorem 4.1
  • proof : Proof (sketch)
  • Theorem 4.2
  • proof : Proof (sketch)
  • Corollary 4.3
  • proof : Proof (sketch)
  • Lemma A.1
  • ...and 5 more