Table of Contents
Fetching ...

Universal Backdoor Attacks Detection via Adaptive Adversarial Probe

Yuhang Wang, Huafeng Shi, Rui Min, Ruijia Wu, Siyuan Liang, Yichao Wu, Ding Liang, Aishan Liu

TL;DR

A2P tackles universal post-training backdoor detection under unseen attack types by using adaptive adversarial probes in a global-to-local framework. It combines attention-guided region generation and box-to-sparsity budget scheduling to activate latent backdoors across varying trigger sizes and transparencies, with MAD-based outlier detection guiding infection calls. Across CIFAR-10, GTSRB, and Tiny-ImageNet, A2P outperforms baselines by large margins and shows robustness to trigger diversity, while also enabling potential backdoor elimination via targeted fine-tuning using probe data. This approach offers a practical means to detect diverse unforeseen backdoors in real-world MLaaS and deployment scenarios.

Abstract

Extensive evidence has demonstrated that deep neural networks (DNNs) are vulnerable to backdoor attacks, which motivates the development of backdoor attacks detection. Most detection methods are designed to verify whether a model is infected with presumed types of backdoor attacks, yet the adversary is likely to generate diverse backdoor attacks in practice that are unforeseen to defenders, which challenge current detection strategies. In this paper, we focus on this more challenging scenario and propose a universal backdoor attacks detection method named Adaptive Adversarial Probe (A2P). Specifically, we posit that the challenge of universal backdoor attacks detection lies in the fact that different backdoor attacks often exhibit diverse characteristics in trigger patterns (i.e., sizes and transparencies). Therefore, our A2P adopts a global-to-local probing framework, which adversarially probes images with adaptive regions/budgets to fit various backdoor triggers of different sizes/transparencies. Regarding the probing region, we propose the attention-guided region generation strategy that generates region proposals with different sizes/locations based on the attention of the target model, since trigger regions often manifest higher model activation. Considering the attack budget, we introduce the box-to-sparsity scheduling that iteratively increases the perturbation budget from box to sparse constraint, so that we could better activate different latent backdoors with different transparencies. Extensive experiments on multiple datasets (CIFAR-10, GTSRB, Tiny-ImageNet) demonstrate that our method outperforms state-of-the-art baselines by large margins (+12%).

Universal Backdoor Attacks Detection via Adaptive Adversarial Probe

TL;DR

A2P tackles universal post-training backdoor detection under unseen attack types by using adaptive adversarial probes in a global-to-local framework. It combines attention-guided region generation and box-to-sparsity budget scheduling to activate latent backdoors across varying trigger sizes and transparencies, with MAD-based outlier detection guiding infection calls. Across CIFAR-10, GTSRB, and Tiny-ImageNet, A2P outperforms baselines by large margins and shows robustness to trigger diversity, while also enabling potential backdoor elimination via targeted fine-tuning using probe data. This approach offers a practical means to detect diverse unforeseen backdoors in real-world MLaaS and deployment scenarios.

Abstract

Extensive evidence has demonstrated that deep neural networks (DNNs) are vulnerable to backdoor attacks, which motivates the development of backdoor attacks detection. Most detection methods are designed to verify whether a model is infected with presumed types of backdoor attacks, yet the adversary is likely to generate diverse backdoor attacks in practice that are unforeseen to defenders, which challenge current detection strategies. In this paper, we focus on this more challenging scenario and propose a universal backdoor attacks detection method named Adaptive Adversarial Probe (A2P). Specifically, we posit that the challenge of universal backdoor attacks detection lies in the fact that different backdoor attacks often exhibit diverse characteristics in trigger patterns (i.e., sizes and transparencies). Therefore, our A2P adopts a global-to-local probing framework, which adversarially probes images with adaptive regions/budgets to fit various backdoor triggers of different sizes/transparencies. Regarding the probing region, we propose the attention-guided region generation strategy that generates region proposals with different sizes/locations based on the attention of the target model, since trigger regions often manifest higher model activation. Considering the attack budget, we introduce the box-to-sparsity scheduling that iteratively increases the perturbation budget from box to sparse constraint, so that we could better activate different latent backdoors with different transparencies. Extensive experiments on multiple datasets (CIFAR-10, GTSRB, Tiny-ImageNet) demonstrate that our method outperforms state-of-the-art baselines by large margins (+12%).
Paper Structure (20 sections, 8 equations, 8 figures, 2 tables)

This paper contains 20 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Previous backdoor detection methods focus on detecting whether a model is infected with a presumed type of backdoor attack. In this paper, we focus on the more challenging scenario, where defenders aim to identify infected models that might be embedded with diverse types of unforeseen backdoor attacks.
  • Figure 2: Our A2P works in a global-to-local probing manner. In each stage, our attention-guided region generation module first shrinks the probing region based on the gradients of the target model; our box-to-sparsity budget scheduling module then iteratively increases and finds the appropriate probing budget on the attack region; the generated adversarial examples will be finally sent into an outlier detector for subsequent infected model identification.
  • Figure 3: Model attention of inputs with triggers using Grad-CAM selvaraju2017grad (three images with patch-based triggers and two images with blend-based triggers). The trigger region derives the most attention (gradients) from infected models.
  • Figure 4: Confusion matrix of model predictions on global adversarial attacks with different budgets. (a) infected model with small budget (8/255); (b) infected model with large budget (32/255); (c) clean model with small budget; (d) clean model with large budget.
  • Figure 5: Detection performance with different trigger patterns on CIFAR-10: (a) trigger sizes and (b) trigger transparencies.
  • ...and 3 more figures