UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks

Bingyin Zhao; Yingjie Lao

UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks

Bingyin Zhao, Yingjie Lao

TL;DR

UltraClean addresses the challenge of backdoor attacks in image classification, including both dirty-label and clean-label variants, by framing defense as a dataset cleansing problem. It employs two off-the-shelf denoisers (non-local mean and local median) to generate variants of each training image and leverages the error amplification effect in a pre-trained model to compute a susceptibility score; poisoned samples are removed before retraining. The approach delivers high poison-detection rates and substantial reductions in backdoor success rates with minimal degradation to clean accuracy, outperforming state-of-the-art defenses across diverse datasets and attack types. The method is simple to implement, effective across a range of attacks, and comes with code to facilitate adoption in practice.

Abstract

Backdoor attacks are emerging threats to deep neural networks, which typically embed malicious behaviors into a victim model by injecting poisoned samples. Adversaries can activate the injected backdoor during inference by presenting the trigger on input images. Prior defensive methods have achieved remarkable success in countering dirty-label backdoor attacks where the labels of poisoned samples are often mislabeled. However, these approaches do not work for a recent new type of backdoor -- clean-label backdoor attacks that imperceptibly modify poisoned data and hold consistent labels. More complex and powerful algorithms are demanded to defend against such stealthy attacks. In this paper, we propose UltraClean, a general framework that simplifies the identification of poisoned samples and defends against both dirty-label and clean-label backdoor attacks. Given the fact that backdoor triggers introduce adversarial noise that intensifies in feed-forward propagation, UltraClean first generates two variants of training samples using off-the-shelf denoising functions. It then measures the susceptibility of training samples leveraging the error amplification effect in DNNs, which dilates the noise difference between the original image and denoised variants. Lastly, it filters out poisoned samples based on the susceptibility to thwart the backdoor implantation. Despite its simplicity, UltraClean achieves a superior detection rate across various datasets and significantly reduces the backdoor attack success rate while maintaining a decent model accuracy on clean data, outperforming existing defensive methods by a large margin. Code is available at https://github.com/bxz9200/UltraClean.

UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks

TL;DR

Abstract

Paper Structure (32 sections, 17 equations, 5 figures, 24 tables, 1 algorithm)

This paper contains 32 sections, 17 equations, 5 figures, 24 tables, 1 algorithm.

Introduction
Related Work
Backdoor Attacks on DNN
Defenses
UltraClean
Threat Model
UltraClean Framework
Methodology
Experiments
Experiment Settings
Evaluation on Dirty-Label Attacks
Evaluation on Clean-Label Attacks
Detection on the Poisoned Class
Detection on the Whole Training Dataset
Performance on Clean Datasets
...and 17 more sections

Figures (5)

Figure 1: Illustration of dirty-label and clean-label attacks (Top row: poisoned training samples; Bottom row: backdoored test samples. Red: incorrect labels; Green: correct labels. Dirty-label poisoned samples always possess incorrect labels while clean-label poisoned samples are imperceptible compared to benign samples and possess correct labels.
Figure 2: Spatial and frequency domain views of two backdoor attacks. Top two rows are the dirty-label attack (BadNets gu2017badnets); Bottom two rows are the clean-label attack (Hidden Trigger Backdoor DBLP:conf/aaai/SahaSP20). As shown in the 4th and 8th columns, poisons reveal substantial qualitative noise difference than the clean counterparts. The noise difference in pixel space are amplified during the feed-forward propagation in a deep neural network and become a strong indicator to differentiate poisoned and benign samples.
Figure 3: STRIP against SIG backdoor (left), LCBD (middle) and HTBD (right). It fails to detect backdoor samples crafted by SIG and HTBD.
Figure 4: ASR comparison of using single "median" baseline images, single "mean" baseline images and aggregation of median and mean (SIG). The aggregation of median and mean (red curves) achieves lower post-clean ASR in most cases, demonstrating a better backdoor mitigation capability over single "median" (blue curves) and single "mean" (yellow curves).
Figure 5: ASR comparison of using the single "median" baseline images, single "mean" baseline images and aggregation of median and mean (HTBD).

UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks

TL;DR

Abstract

UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)