Improving the Sensitivity of Backdoor Detectors via Class Subspace Orthogonalization
Guangmingmei Yang, David J. Miller, George Kesidis
TL;DR
The paper tackles the challenge of post-training backdoor detection, which can fail when non-target classes have strong intrinsic features or when backdoors are subtle. It introduces Class Subspace Orthogonalization (CSO), a detector-agnostic, data-efficient framework that identifies intrinsic class features with per-class masks and enforces orthogonality during detector optimization to suppress these features, thereby amplifying backdoor signals. CSO is integrated into multiple detectors (MMBD, NC, PT-RED, and MLBD) to form MMBD-CSO, NC-CSO, PT-RED-CSO, and MLBD-CSO, and is paired with a novel mixed dirty/clean-label X-to-X attack to stress-test defenses. Experimental results across CIFAR-10, GTSRB, and TinyImageNet show CSO substantially improves detection accuracy and robustness to adaptive attacks, with modest time overhead and strong resilience to various stealthy backdoor strategies, highlighting its practical potential for safeguarding deployed models.
Abstract
Most post-training backdoor detection methods rely on attacked models exhibiting extreme outlier detection statistics for the target class of an attack, compared to non-target classes. However, these approaches may fail: (1) when some (non-target) classes are easily discriminable from all others, in which case they may naturally achieve extreme detection statistics (e.g., decision confidence); and (2) when the backdoor is subtle, i.e., with its features weak relative to intrinsic class-discriminative features. A key observation is that the backdoor target class has contributions to its detection statistic from both the backdoor trigger and from its intrinsic features, whereas non-target classes only have contributions from their intrinsic features. To achieve more sensitive detectors, we thus propose to suppress intrinsic features while optimizing the detection statistic for a given class. For non-target classes, such suppression will drastically reduce the achievable statistic, whereas for the target class the (significant) contribution from the backdoor trigger remains. In practice, we formulate a constrained optimization problem, leveraging a small set of clean examples from a given class, and optimizing the detection statistic while orthogonalizing with respect to the class's intrinsic features. We dub this plug-and-play approach Class Subspace Orthogonalization (CSO) and assess it against challenging mixed-label and adaptive attacks.
