Table of Contents
Fetching ...

Invisible Backdoor Attack against Self-supervised Learning

Hanrong Zhang, Zhenting Wang, Boheng Li, Fulin Lin, Tingxu Han, Mingyu Jin, Chenlu Zhan, Mengnan Du, Hongwei Wang, Shiqing Ma

TL;DR

Self-supervised learning models are vulnerable to backdoor attacks, and existing invisible triggers designed for supervised learning underperform in SSL due to entanglement between backdoor signals and augmentation. The authors propose INACTIVE, an imperceptible backdoor that disentangles the backdoor trigger from SSL augmentations by operating in HSV/HSL color spaces and learning alignment and stealth constraints. They demonstrate near-perfect attack success rates across five datasets and six SSL algorithms, with strong stealth metrics and robustness to defenses. The work highlights a substantial security risk in SSL pipelines and suggests a need for defenses that account for distributional disentanglement and perceptual stealth. Code is released for reproducibility.

Abstract

Self-supervised learning (SSL) models are vulnerable to backdoor attacks. Existing backdoor attacks that are effective in SSL often involve noticeable triggers, like colored patches or visible noise, which are vulnerable to human inspection. This paper proposes an imperceptible and effective backdoor attack against self-supervised models. We first find that existing imperceptible triggers designed for supervised learning are less effective in compromising self-supervised models. We then identify this ineffectiveness is attributed to the overlap in distributions between the backdoor and augmented samples used in SSL. Building on this insight, we design an attack using optimized triggers disentangled with the augmented transformation in the SSL, while remaining imperceptible to human vision. Experiments on five datasets and six SSL algorithms demonstrate our attack is highly effective and stealthy. It also has strong resistance to existing backdoor defenses. Our code can be found at https://github.com/Zhang-Henry/INACTIVE.

Invisible Backdoor Attack against Self-supervised Learning

TL;DR

Self-supervised learning models are vulnerable to backdoor attacks, and existing invisible triggers designed for supervised learning underperform in SSL due to entanglement between backdoor signals and augmentation. The authors propose INACTIVE, an imperceptible backdoor that disentangles the backdoor trigger from SSL augmentations by operating in HSV/HSL color spaces and learning alignment and stealth constraints. They demonstrate near-perfect attack success rates across five datasets and six SSL algorithms, with strong stealth metrics and robustness to defenses. The work highlights a substantial security risk in SSL pipelines and suggests a need for defenses that account for distributional disentanglement and perceptual stealth. Code is released for reproducibility.

Abstract

Self-supervised learning (SSL) models are vulnerable to backdoor attacks. Existing backdoor attacks that are effective in SSL often involve noticeable triggers, like colored patches or visible noise, which are vulnerable to human inspection. This paper proposes an imperceptible and effective backdoor attack against self-supervised models. We first find that existing imperceptible triggers designed for supervised learning are less effective in compromising self-supervised models. We then identify this ineffectiveness is attributed to the overlap in distributions between the backdoor and augmented samples used in SSL. Building on this insight, we design an attack using optimized triggers disentangled with the augmented transformation in the SSL, while remaining imperceptible to human vision. Experiments on five datasets and six SSL algorithms demonstrate our attack is highly effective and stealthy. It also has strong resistance to existing backdoor defenses. Our code can be found at https://github.com/Zhang-Henry/INACTIVE.
Paper Structure (34 sections, 1 theorem, 19 equations, 19 figures, 19 tables, 2 algorithms)

This paper contains 34 sections, 1 theorem, 19 equations, 19 figures, 19 tables, 2 algorithms.

Key Result

Theorem 3.1

Given a perfectly-trained encoder $\mathcal{F}_\theta$ based on the augmentations sampled from predefined augmentation space $\mathcal{S}_\mathcal{A}$, it is impossible to inject a backdoor with trigger function $\mathcal{I} \in \mathcal{S}_\mathcal{A}$.

Figures (19)

  • Figure 1: Comparison of clean, backdoored samples created by Patch trigger used by BadEncoder Jia_Liu_Gong_2022 and DRUPE tao2023distribution, Instagram filter trigger pilgram, ISSBA trigger li2021invisible, WaNet trigger nguyen2021wanet and ours. Except for DRUPE, the ASRs are tested under the threat model of BadEncoder. Residuals are the difference between clean and backdoored images. Our method achieves the highest ASR while maintaining trigger stealthiness, while other methods either have a much lower ASR or use more easily detectable triggers.
  • Figure 2: Existing imperceptible backdoor triggers, which yield high ASR in supervised learning (SL), do not perform as effectively in SSL. The attack framework for SL and SSL are standard backdoor poisoning gu2017badnets and BadEncoder Jia_Liu_Gong_2022, respectively.
  • Figure 3: t-SNE visualization of the feature space in the inherent augmentation and backdoor trigger space. The SimCLR chen2020simple pre-trained model struggled to differentiate between backdoor samples injected with the WaNet trigger nguyen2021wanet and the augmented samples within the SimCLR contrastive learning framework.
  • Figure A1: Beatrix's deviation distribution ma2022beatrix for clean and backdoored data, where blue bars represent clean samples and orange bars indicate poisoned ones. Poisoned samples from BadEncoder display markedly higher deviations than clean samples, allowing Beatrix to detect them effectively. However, the deviation distributions for clean inputs and poisoned samples from INACTIVE significantly overlap, complicating their differentiation.
  • Figure A2: Resilience to Grad-CAM Selvaraju_2017_ICCV. The resemblance observed in these heatmaps indicates that INACTIVE is capable of resisting defenses that rely on Grad-CAM.
  • ...and 14 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • proof