Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks

Alexander Unnervik; Hatef Otroshi Shahreza; Anjith George; Sébastien Marcel

Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks

Alexander Unnervik, Hatef Otroshi Shahreza, Anjith George, Sébastien Marcel

TL;DR

This paper proposes to use model pairs on open-set classification tasks for detecting backdoors using a simple linear operation to project embeddings from a probe model's embedding space to a reference model's embedding space, and shows that this score can be an indicator for the presence of a backdoor.

Abstract

Backdoor attacks allow an attacker to embed a specific vulnerability in a machine learning algorithm, activated when an attacker-chosen pattern is presented, causing a specific misprediction. The need to identify backdoors in biometric scenarios has led us to propose a novel technique with different trade-offs. In this paper we propose to use model pairs on open-set classification tasks for detecting backdoors. Using a simple linear operation to project embeddings from a probe model's embedding space to a reference model's embedding space, we can compare both embeddings and compute a similarity score. We show that this score, can be an indicator for the presence of a backdoor despite models being of different architectures, having been trained independently and on different datasets. This technique allows for the detection of backdoors on models designed for open-set classification tasks, which is little studied in the literature. Additionally, we show that backdoors can be detected even when both models are backdoored. The source code is made available for reproducibility purposes.

Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks

TL;DR

Abstract

Paper Structure (28 sections, 11 equations, 4 figures, 5 tables)

This paper contains 28 sections, 11 equations, 4 figures, 5 tables.

Introduction
Proposed Method
Threat model
Backdoor Attack Detection via Model Pairing
The embedding translator
The score
Experiments
Experimental setup
Analysis
Training the backdoored networks
Detection metrics on model pairs
Discussion
Conclusion
Background and related work
Backdoor attacks in open-set classification
...and 13 more sections

Figures (4)

Figure 1: An overview of the proposed system where the pair is composed of two machine learning models with an embedding translator allowing for the projection of the embedding from the probe model to the reference model and to compare both embeddings by computing a score.
Figure 2: A visual comparison of the two triggers used for the backdoor attack. Left: the large checkerboard trigger. Right: the small square trigger.
Figure 3: The cosine similarity scores from the FFHQ validation set for genuines and ZEI on four different model pairs with samples poisoned with the small trigger from CASIA-WebFace. Ideally, for clean model pairs (Figure \ref{['fig:val_scores_clean_fn2if']} and \ref{['fig:val_scores_clean_if2fn']}), the poisoned attacker samples (red) distribution should overlap with the distribution of genuine samples (green) as much as possible, whereas for an ideal backdoored model pairs (Figure \ref{['fig:val_scores_bd_fn2if']} and \ref{['fig:val_scores_bd_if2fn']}), the poisoned attacker samples (red) distribution should overlap with the distribution of ZEI (blue) as much as possible.
Figure 4: A t-SNE plot of the embeddings from two model pairs with InsightFace as $M_{ref}$ and FaceNet (backdoored using the small trigger) as $M_{prb}$ with various clean and poisoned samples. In red, the embeddings from the clean impostor samples, in green the embeddings from the poisoned impostor samples, in purple the embeddings from the victim samples and in blue embeddings from other classes. The $\mathbf{e}_{trs}$ are the crosses and $\mathbf{e}_{ref}$ are the circles. Notice how for $\mathbf{e}_{trs}$ the samples from the impostor class, with trigger, approach the victim class cluster and distance themselves from the clean impostor cluster. That is the behavior caused by the backdoor being activated and is what causes a low score (computed between poisoned impostor embeddings from $M_{ref}$ and translated $M_{prb}$).

Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks

TL;DR

Abstract

Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)