Table of Contents
Fetching ...

SSL-Cleanse: Trojan Detection and Mitigation in Self-Supervised Learning

Mengxin Zheng, Jiaqi Xue, Zihao Wang, Xun Chen, Qian Lou, Lei Jiang, Xiaofeng Wang

TL;DR

The paper tackles Trojan backdoors in self-supervised learning (SSL) encoders, which threaten downstream classifiers and are difficult to detect due to unknown downstream tasks and unlabeled data. It introduces SSL-Cleanse, a two-part defense consisting of a Detector that uses Sliding Window Kneedle (SWK) to auto-estimate cluster counts, Representation-Oriented Trigger Reverse (ROTR) to synthesize candidate triggers, and Size-Norm Trigger Outlier Detector (STOD) to flag Trojaned encoders, plus a Mitigator that employs Self-supervised Clustering Unlearning (SCU) to deactivate backdoors while preserving normal performance. The approach does not require downstream labels or access to poisoned datasets, and it is validated on 1200 encoders under SSL-Backdoor and CTRL attacks across BYOL, SimCLR, and MoCo V2, achieving an average detection accuracy of about $81.3\%$ on ImageNet-100 and reducing attack success rates to well below $2\%$ with minimal accuracy loss. The work demonstrates practical encoder-level defenses to curb the spread of Trojaned SSL models, with open-source code enabling broader adoption and evaluation.

Abstract

Self-supervised learning (SSL) is a prevalent approach for encoding data representations. Using a pre-trained SSL image encoder and subsequently training a downstream classifier, impressive performance can be achieved on various tasks with very little labeled data. The growing adoption of SSL has led to an increase in security research on SSL encoders and associated Trojan attacks. Trojan attacks embedded in SSL encoders can operate covertly, spreading across multiple users and devices. The presence of backdoor behavior in Trojaned encoders can inadvertently be inherited by downstream classifiers, making it even more difficult to detect and mitigate the threat. Although current Trojan detection methods in supervised learning can potentially safeguard SSL downstream classifiers, identifying and addressing triggers in the SSL encoder before its widespread dissemination is a challenging task. This challenge arises because downstream tasks might be unknown, dataset labels may be unavailable, and the original unlabeled training dataset might be inaccessible during Trojan detection in SSL encoders. We introduce SSL-Cleanse as a solution to identify and mitigate backdoor threats in SSL encoders. We evaluated SSL-Cleanse on various datasets using 1200 encoders, achieving an average detection success rate of 82.2% on ImageNet-100. After mitigating backdoors, on average, backdoored encoders achieve 0.3% attack success rate without great accuracy loss, proving the effectiveness of SSL-Cleanse. The source code of SSL-Cleanse is available at https://github.com/UCF-ML-Research/SSL-Cleanse.

SSL-Cleanse: Trojan Detection and Mitigation in Self-Supervised Learning

TL;DR

The paper tackles Trojan backdoors in self-supervised learning (SSL) encoders, which threaten downstream classifiers and are difficult to detect due to unknown downstream tasks and unlabeled data. It introduces SSL-Cleanse, a two-part defense consisting of a Detector that uses Sliding Window Kneedle (SWK) to auto-estimate cluster counts, Representation-Oriented Trigger Reverse (ROTR) to synthesize candidate triggers, and Size-Norm Trigger Outlier Detector (STOD) to flag Trojaned encoders, plus a Mitigator that employs Self-supervised Clustering Unlearning (SCU) to deactivate backdoors while preserving normal performance. The approach does not require downstream labels or access to poisoned datasets, and it is validated on 1200 encoders under SSL-Backdoor and CTRL attacks across BYOL, SimCLR, and MoCo V2, achieving an average detection accuracy of about on ImageNet-100 and reducing attack success rates to well below with minimal accuracy loss. The work demonstrates practical encoder-level defenses to curb the spread of Trojaned SSL models, with open-source code enabling broader adoption and evaluation.

Abstract

Self-supervised learning (SSL) is a prevalent approach for encoding data representations. Using a pre-trained SSL image encoder and subsequently training a downstream classifier, impressive performance can be achieved on various tasks with very little labeled data. The growing adoption of SSL has led to an increase in security research on SSL encoders and associated Trojan attacks. Trojan attacks embedded in SSL encoders can operate covertly, spreading across multiple users and devices. The presence of backdoor behavior in Trojaned encoders can inadvertently be inherited by downstream classifiers, making it even more difficult to detect and mitigate the threat. Although current Trojan detection methods in supervised learning can potentially safeguard SSL downstream classifiers, identifying and addressing triggers in the SSL encoder before its widespread dissemination is a challenging task. This challenge arises because downstream tasks might be unknown, dataset labels may be unavailable, and the original unlabeled training dataset might be inaccessible during Trojan detection in SSL encoders. We introduce SSL-Cleanse as a solution to identify and mitigate backdoor threats in SSL encoders. We evaluated SSL-Cleanse on various datasets using 1200 encoders, achieving an average detection success rate of 82.2% on ImageNet-100. After mitigating backdoors, on average, backdoored encoders achieve 0.3% attack success rate without great accuracy loss, proving the effectiveness of SSL-Cleanse. The source code of SSL-Cleanse is available at https://github.com/UCF-ML-Research/SSL-Cleanse.
Paper Structure (8 sections, 2 equations, 6 figures, 5 tables, 3 algorithms)

This paper contains 8 sections, 2 equations, 6 figures, 5 tables, 3 algorithms.

Figures (6)

  • Figure 1: The overview of SSL-Cleanse. SSL-Cleanse has two components, Detector and Mitigator, aiming to remove the malicious behavior of Trojaned SSL encoders.
  • Figure 2: The workflow of SSL-Cleanse detector. Step 1: Unlabeled data samples are processed through the SSL encoder to compute their representations. The SWK algorithm is then utilized to process representations and determine the number of clusters. Step 2: Using K-Means with the derived cluster number (K) and representation, K clusters are established. Then, the Representation Oriented Trigger Reverse algorithm is employed to generate K trigger patterns. Step 3: Accessing if any of the K triggers are outliers in terms of their size and norm. The identified outlier indicates the encoder is Trojaned.
  • Figure 3: Comparison of our SWK method and direct search (Kneedle) method on ImageNet-100 dataset. Our SWK method yields more stable and accurate K.
  • Figure 4: Illustration of Self-supervised Clustering Unlearning (SCU). The image $x$ is sampled from a cluster distinct from the cluster producing trigger $t$.
  • Figure 5: A comparison of detection accuracy between SSL-Cleanse using the SWK method and the direct search (Kneedle)on ImageNet-100.
  • ...and 1 more figures