Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders

Tingxu Han; Weisong Sun; Ziqi Ding; Chunrong Fang; Hanwei Qian; Jiaxun Li; Zhenyu Chen; Xiangyu Zhang

Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders

Tingxu Han, Weisong Sun, Ziqi Ding, Chunrong Fang, Hanwei Qian, Jiaxun Li, Zhenyu Chen, Xiangyu Zhang

TL;DR

This work tackles backdoor threats in self-supervised pre-trained encoders by introducing MIMIC, a two-phase defense that first uses mutual information to locate benign knowledge within a backdoored encoder and then distills this knowledge into an empty student network. The MI-guided benign knowledge localization combined with clone and attention losses enables the transfer of clean features while suppressing malicious patterns, achieving substantial reductions in attack success rate with minimal impact on clean accuracy using less than 5% clean data. Across four datasets and two SSL backdoor attacks, MIMIC outperforms seven baselines, demonstrates robustness to varying trigger sizes, data fractions, and adaptive threats, and generalizes to supervised learning settings. The framework offers a practical, task-agnostic defense for SSL pipelines, highlighting the pivotal role of mutual information in preserving benign representations during backdoor mitigation.

Abstract

Self-supervised learning (SSL) is increasingly attractive for pre-training encoders without requiring labeled data. Downstream tasks built on top of those pre-trained encoders can achieve nearly state-of-the-art performance. The pre-trained encoders by SSL, however, are vulnerable to backdoor attacks as demonstrated by existing studies. Numerous backdoor mitigation techniques are designed for downstream task models. However, their effectiveness is impaired and limited when adapted to pre-trained encoders, due to the lack of label information when pre-training. To address backdoor attacks against pre-trained encoders, in this paper, we innovatively propose a mutual information guided backdoor mitigation technique, named MIMIC. MIMIC treats the potentially backdoored encoder as the teacher net and employs knowledge distillation to distill a clean student encoder from the teacher net. Different from existing knowledge distillation approaches, MIMIC initializes the student with random weights, inheriting no backdoors from teacher nets. Then MIMIC leverages mutual information between each layer and extracted features to locate where benign knowledge lies in the teacher net, with which distillation is deployed to clone clean features from teacher to student. We craft the distillation loss with two aspects, including clone loss and attention loss, aiming to mitigate backdoors and maintain encoder performance at the same time. Our evaluation conducted on two backdoor attacks in SSL demonstrates that MIMIC can significantly reduce the attack success rate by only utilizing <5% of clean data, surpassing seven state-of-the-art backdoor mitigation techniques.

Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders

TL;DR

Abstract

Paper Structure (29 sections, 3 theorems, 12 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 3 theorems, 12 equations, 14 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Background
Self-supervised learning
Backdoor attack on self-supervised learning
Backdoor defense on self-supervised learning
Mutual information
Motivation
Methodology
MI-Guided Benign Knowledge Localization
Distillation Training
Evaluation
Experimental Setup
RQ1:Evaluation Results on Effectiveness
RQ1.1: How effective is MIMIC in removing backdoors in SSL
...and 14 more sections

Key Result

Theorem 1

$I(\mathcal{F}_\theta^{0}(x),z)\leq I(\mathcal{F}_\theta^{1}(x),z)\leq\cdots\leq I(\mathcal{F}_\theta^{n-2}(x),z)\leq I(\mathcal{F}_\theta^{n-1}(x),z)$, where $z$ denotes the final extracted features by pre-trained encoders and $\mathcal{F}_\theta^l(\cdot)$ the outputs of $l$-th layer.

Figures (14)

Figure 1: The outlines of MIMIC's two steps.
Figure 2: Backdoor attack against pre-trained encoders. Firstly, an attacker injects backdoors into an encoder and releases the poison encoder online, e.g., Hugging Face. Secondly, a user trains a classifier built on the backdoored encoder for a downstream task. During inference, the classifier built on the backdoored encoder has high accuracy on clean inputs but misclassifies inputs with the trigger as the attacker-chosen target.
Figure 3: The performance of teacher nets.
Figure 4: Mutual information guidance.It showcases the mutual information between each layer's output and the final latent extracted features.
Figure 5: Layer-wise outputs: With $\mathcal{X}$ as the input and $\mathcal{Z}$ as the final extracted features, the diagram highlights the data flow through each layer of the encoder. The classification linear probe is trained on $\mathcal{Z}$, utilizing the final extracted features for downstream tasks.
...and 9 more figures

Theorems & Definitions (4)

Theorem 1: Markovian property of benign features
Proposition 1: Reversed Markov chain
proof
Theorem 2: Data Processing Inequality

Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders

TL;DR

Abstract

Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (4)