Table of Contents
Fetching ...

CEPA: Consensus Embedded Perturbation for Agnostic Detection and Inversion of Backdoors

Guangmingmei Yang, Xi Li, Hang Wang, David J. Miller, George Kesidis

TL;DR

CEPA tackles backdoor poisoning in neural networks by delivering a post-training detector and backdoor inverter that is agnostic to the embedding mechanism. It optimizes perturbations in input space and a common embedded feature perturbation to reveal a consensus backdoor pattern across candidate target pairs, using MAD-based statistics on three metrics to detect backdoors and infer targets. When a backdoor is detected, CEPA inverts the pattern by constraining perturbations with positive weights $\lambda_1$ and $\lambda_2$, yielding interpretable trigger estimates while maintaining classification performance on clean data. Empirically, CEPA demonstrates strong detection accuracy and reliable backdoor pattern reconstruction across CIFAR-10/100 using multiple attack types, with competitive or superior performance to seven baselines and robustness to adaptive attacks, all while requiring only a small clean dataset and minimal hyperparameter tuning.

Abstract

A variety of defenses have been proposed against Trojans planted in (backdoor attacks on) deep neural network (DNN) classifiers. Backdoor-agnostic methods seek to reliably detect and/or to mitigate backdoors irrespective of the incorporation mechanism used by the attacker, while inversion methods explicitly assume one. In this paper, we describe a new detector that: relies on embedded feature representations to estimate (invert) the backdoor and to identify its target class; can operate without access to the training dataset; and is highly effective for various incorporation mechanisms (i.e., is backdoor agnostic). Our detection approach is evaluated -- and found to be favorable - in comparison with an array of published defenses for a variety of different attacks on the CIFAR-10 and CIFAR-100 image-classification domains.

CEPA: Consensus Embedded Perturbation for Agnostic Detection and Inversion of Backdoors

TL;DR

CEPA tackles backdoor poisoning in neural networks by delivering a post-training detector and backdoor inverter that is agnostic to the embedding mechanism. It optimizes perturbations in input space and a common embedded feature perturbation to reveal a consensus backdoor pattern across candidate target pairs, using MAD-based statistics on three metrics to detect backdoors and infer targets. When a backdoor is detected, CEPA inverts the pattern by constraining perturbations with positive weights and , yielding interpretable trigger estimates while maintaining classification performance on clean data. Empirically, CEPA demonstrates strong detection accuracy and reliable backdoor pattern reconstruction across CIFAR-10/100 using multiple attack types, with competitive or superior performance to seven baselines and robustness to adaptive attacks, all while requiring only a small clean dataset and minimal hyperparameter tuning.

Abstract

A variety of defenses have been proposed against Trojans planted in (backdoor attacks on) deep neural network (DNN) classifiers. Backdoor-agnostic methods seek to reliably detect and/or to mitigate backdoors irrespective of the incorporation mechanism used by the attacker, while inversion methods explicitly assume one. In this paper, we describe a new detector that: relies on embedded feature representations to estimate (invert) the backdoor and to identify its target class; can operate without access to the training dataset; and is highly effective for various incorporation mechanisms (i.e., is backdoor agnostic). Our detection approach is evaluated -- and found to be favorable - in comparison with an array of published defenses for a variety of different attacks on the CIFAR-10 and CIFAR-100 image-classification domains.
Paper Structure (27 sections, 4 equations, 12 figures, 10 tables)

This paper contains 27 sections, 4 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Layer 9, $\|\mu\|$ values by class, for typical poisoned and clean models with true target class in red.
  • Figure 2: For the BadNet attack: (a) the ground truth perturbation, and (d) the average estimated perturbation $|\mathcal{D}|^{-1}\sum_{x\in \mathcal{D}}\hat{\delta}_x$. For the chessboard attack: (b) the ground truth perturbation (with contrast boosted for better visualization), and (e) a typical $\delta$. For the blended attack: (c) the ground truth perturbation, and (f) $|\mathcal{D}|^{-1}\sum_{x\in \mathcal{D}}\hat{\delta}_x$.
  • Figure 3: Histogram of MAD anomaly scores for CEPA against BadNet on a typical poisoned VGG-16 model for CIFAR-100 calculated at layer 8 using $\|\mu\|$ (left) and $\frac{\sigma}{\|\mu\|}$ (right).
  • Figure 4: Example backdoor patterns (top) and corresponding poisoned images (bottom) used in our experiments.
  • Figure 5: Layer 9, $\sigma/\|\mu\|$ values by class, for typical poisoned and clean models, with the true target class in red.
  • ...and 7 more figures