CEPA: Consensus Embedded Perturbation for Agnostic Detection and Inversion of Backdoors
Guangmingmei Yang, Xi Li, Hang Wang, David J. Miller, George Kesidis
TL;DR
CEPA tackles backdoor poisoning in neural networks by delivering a post-training detector and backdoor inverter that is agnostic to the embedding mechanism. It optimizes perturbations in input space and a common embedded feature perturbation to reveal a consensus backdoor pattern across candidate target pairs, using MAD-based statistics on three metrics to detect backdoors and infer targets. When a backdoor is detected, CEPA inverts the pattern by constraining perturbations with positive weights $\lambda_1$ and $\lambda_2$, yielding interpretable trigger estimates while maintaining classification performance on clean data. Empirically, CEPA demonstrates strong detection accuracy and reliable backdoor pattern reconstruction across CIFAR-10/100 using multiple attack types, with competitive or superior performance to seven baselines and robustness to adaptive attacks, all while requiring only a small clean dataset and minimal hyperparameter tuning.
Abstract
A variety of defenses have been proposed against Trojans planted in (backdoor attacks on) deep neural network (DNN) classifiers. Backdoor-agnostic methods seek to reliably detect and/or to mitigate backdoors irrespective of the incorporation mechanism used by the attacker, while inversion methods explicitly assume one. In this paper, we describe a new detector that: relies on embedded feature representations to estimate (invert) the backdoor and to identify its target class; can operate without access to the training dataset; and is highly effective for various incorporation mechanisms (i.e., is backdoor agnostic). Our detection approach is evaluated -- and found to be favorable - in comparison with an array of published defenses for a variety of different attacks on the CIFAR-10 and CIFAR-100 image-classification domains.
