Table of Contents
Fetching ...

Backdoor Defense through Self-Supervised and Generative Learning

Ivan Sabolić, Ivan Grubišić, Siniša Šegvić

TL;DR

This paper explores an approach based on generative modelling of per-class distributions in a self-supervised representation space and finds that per-class generative models allow to detect poisoned data and cleanse the dataset.

Abstract

Backdoor attacks change a small portion of training data by introducing hand-crafted triggers and rewiring the corresponding labels towards a desired target class. Training on such data injects a backdoor which causes malicious inference in selected test samples. Most defenses mitigate such attacks through various modifications of the discriminative learning procedure. In contrast, this paper explores an approach based on generative modelling of per-class distributions in a self-supervised representation space. Interestingly, these representations get either preserved or heavily disturbed under recent backdoor attacks. In both cases, we find that per-class generative models allow to detect poisoned data and cleanse the dataset. Experiments show that training on cleansed dataset greatly reduces the attack success rate and retains the accuracy on benign inputs.

Backdoor Defense through Self-Supervised and Generative Learning

TL;DR

This paper explores an approach based on generative modelling of per-class distributions in a self-supervised representation space and finds that per-class generative models allow to detect poisoned data and cleanse the dataset.

Abstract

Backdoor attacks change a small portion of training data by introducing hand-crafted triggers and rewiring the corresponding labels towards a desired target class. Training on such data injects a backdoor which causes malicious inference in selected test samples. Most defenses mitigate such attacks through various modifications of the discriminative learning procedure. In contrast, this paper explores an approach based on generative modelling of per-class distributions in a self-supervised representation space. Interestingly, these representations get either preserved or heavily disturbed under recent backdoor attacks. In both cases, we find that per-class generative models allow to detect poisoned data and cleanse the dataset. Experiments show that training on cleansed dataset greatly reduces the attack success rate and retains the accuracy on benign inputs.
Paper Structure (56 sections, 9 equations, 4 figures, 17 tables, 1 algorithm)

This paper contains 56 sections, 9 equations, 4 figures, 17 tables, 1 algorithm.

Figures (4)

  • Figure 1: 2D UMAP visualization of the self-supervised feature space for CIFAR-10. Poisoned samples are shown in black, while clean samples are shown in colour. The target class is in brown (airplane). Non-disruptive attacks (left, gu2019badnets) exert a very small influence to the self-supervised embeddings. Disruptive attacks (right, turner2019label) displace the poisoned samples from the manifold of the training data.
  • Figure 2: Distributions of the maximum foreign density $v_y(\bm{z})$ of clean and target classes in presence of the Label-Consistent attack turner2019label on CIFAR-10, and a strong BadNets attack on ImageNet-30. In contrast to clean classes, target classes exhibit strong bimodality because poisoned samples tend to cluster near $0$. Note that both attacks are disruptive.
  • Figure 3: The values of the poisoning score (\ref{['eq:score']}) for all samples within one target class. Clean samples are shown in blue and poisoned samples in red.
  • Figure F.1: Histogram of $L^2$ distances between self-supervised embeddings for BadNets attack on CIFAR-10 dataset. Distances between clean and poisoned versions of the same example are colored in brown. Blue denotes distances between samples of the same classes (intra-class), while green marks distances between samples from different classes (inter-class).