Table of Contents
Fetching ...

TrojanDec: Data-free Detection of Trojan Inputs in Self-supervised Learning

Yupei Liu, Yanting Wang, Jinyuan Jia

TL;DR

Self-supervised encoders are vulnerable to Trojan inputs that can mislead downstream classifiers. The authors introduce TrojanDec, a data-free, testing-phase defense that detects and recovers trojaned test inputs by extracting metadata via random patch-based masking, clustering with the gap statistic, and restoring inputs using diffusion-based denoising (DDNM). It requires no clean data or training data and operates with black-box encoder access, showing strong detection and restoration across multiple attacks and real-world encoders, outperforming existing defenses in data-free settings. The approach offers practical protection for cloud-based SSL pipelines while preserving utility, highlighting its relevance for secure deployment of self-supervised representations.

Abstract

An image encoder pre-trained by self-supervised learning can be used as a general-purpose feature extractor to build downstream classifiers for various downstream tasks. However, many studies showed that an attacker can embed a trojan into an encoder such that multiple downstream classifiers built based on the trojaned encoder simultaneously inherit the trojan behavior. In this work, we propose TrojanDec, the first data-free method to identify and recover a test input embedded with a trigger. Given a (trojaned or clean) encoder and a test input, TrojanDec first predicts whether the test input is trojaned. If not, the test input is processed in a normal way to maintain the utility. Otherwise, the test input will be further restored to remove the trigger. Our extensive evaluation shows that TrojanDec can effectively identify the trojan (if any) from a given test input and recover it under state-of-the-art trojan attacks. We further demonstrate by experiments that our TrojanDec outperforms the state-of-the-art defenses.

TrojanDec: Data-free Detection of Trojan Inputs in Self-supervised Learning

TL;DR

Self-supervised encoders are vulnerable to Trojan inputs that can mislead downstream classifiers. The authors introduce TrojanDec, a data-free, testing-phase defense that detects and recovers trojaned test inputs by extracting metadata via random patch-based masking, clustering with the gap statistic, and restoring inputs using diffusion-based denoising (DDNM). It requires no clean data or training data and operates with black-box encoder access, showing strong detection and restoration across multiple attacks and real-world encoders, outperforming existing defenses in data-free settings. The approach offers practical protection for cloud-based SSL pipelines while preserving utility, highlighting its relevance for secure deployment of self-supervised representations.

Abstract

An image encoder pre-trained by self-supervised learning can be used as a general-purpose feature extractor to build downstream classifiers for various downstream tasks. However, many studies showed that an attacker can embed a trojan into an encoder such that multiple downstream classifiers built based on the trojaned encoder simultaneously inherit the trojan behavior. In this work, we propose TrojanDec, the first data-free method to identify and recover a test input embedded with a trigger. Given a (trojaned or clean) encoder and a test input, TrojanDec first predicts whether the test input is trojaned. If not, the test input is processed in a normal way to maintain the utility. Otherwise, the test input will be further restored to remove the trigger. Our extensive evaluation shows that TrojanDec can effectively identify the trojan (if any) from a given test input and recover it under state-of-the-art trojan attacks. We further demonstrate by experiments that our TrojanDec outperforms the state-of-the-art defenses.
Paper Structure (26 sections, 1 theorem, 7 equations, 8 figures, 11 tables, 2 algorithms)

This paper contains 26 sections, 1 theorem, 7 equations, 8 figures, 11 tables, 2 algorithms.

Key Result

Proposition 1

If the adversary sets the trojan trigger to be $\mathbf{e}$ whose height and width are $e_h$ and $e_w$, while defender has a mask $(\mathbf{m}, \mathbf{p})$ such that the pattern $\mathbf{p}$ is randomly generated, the probability of the existence of a part of the mask pattern such that $\ell_1$ dis

Figures (8)

  • Figure 1: An overview of our TrojanDec.
  • Figure 2: Cosine similarities of the feature vectors of masked images and the feature vector of the original trojaned test image. The trojaned encoder is pre-trained on CIFAR10 and the downstream dataset is STL10.
  • Figure 3: Example of an image from the STL10 dataset recovered from its masked prototype: (a) original, (b) trojaned, (c) masked, and (d) restored.
  • Figure 4: Comparison to Beatrix and Strip on STL10.
  • Figure 5: Comparison to Beatrix and Strip on SVHN.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Proposition 1