TRIGS: Trojan Identification from Gradient-based Signatures

Mohamed E. Hussein; Sudharshan Subramaniam Janakiraman; Wael AbdAlmageed

TRIGS: Trojan Identification from Gradient-based Signatures

Mohamed E. Hussein, Sudharshan Subramaniam Janakiraman, Wael AbdAlmageed

TL;DR

TRIGS introduces a gradient-based, signature-driven approach to Trojan backdoor detection that is agnostic to probe-model architecture. By optimizing multiple loss functions over input prompts, TRIGS constructs a fixed-size signature from activation maps, optionally compressed via pixel-wise statistics, and trains a binary detector to identify Trojan models. The method achieves state-of-the-art AUC performance on public CIFAR10 and Tiny ImageNet datasets and excels on a newly released ImageNet ViT dataset, even under limited clean data and architecture mismatch scenarios. This work provides a practical, data-efficient defense framework and release of a challenging ViT-based Trojan dataset to advance robustness against backdoor attacks.

Abstract

Training machine learning models can be very expensive or even unaffordable. This may be, for example, due to data limitations, such as unavailability or being too large, or computational power limitations. Therefore, it is a common practice to rely on open-source pre-trained models whenever possible.However, this practice is alarming from a security perspective. Pre-trained models can be infected with Trojan attacks, in which the attacker embeds a trigger in the model such that the model's behavior can be controlled by the attacker when the trigger is present in the input. In this paper, we present a novel method for detecting Trojan models. Our method creates a signature for a model based on activation optimization. A classifier is then trained to detect a Trojan model given its signature. We call our method TRIGS for TRojan Identification from Gradient-based Signatures. TRIGS achieves state-of-the-art performance on two public datasets of convolutional models. Additionally, we introduce a new challenging dataset of ImageNet models based on the vision transformer architecture. TRIGS delivers the best performance on the new dataset, surpassing the baseline methods by a large margin. Our experiments also show that TRIGS requires only a small amount of clean samples to achieve good performance, and works reasonably well even if the defender does not have prior knowledge about the attacker's model architecture. Our code and data are publicly available.

TRIGS: Trojan Identification from Gradient-based Signatures

TL;DR

Abstract

Paper Structure (25 sections, 6 equations, 11 figures, 2 tables)

This paper contains 25 sections, 6 equations, 11 figures, 2 tables.

Introduction
Related work
Activation maximization
Trojan attacks
Defenses against Trojan attacks
Datasets for Trojan attack defense
Approach
Threat model
Intuition
Framework
Activation optimization
Regularization
$L_2$ regularization
Total variation regularization
Feature extraction
...and 10 more sections

Figures (11)

Figure 1: Proposed framework for Trojan model detection. Given a $K$-class classifier, $M$ loss functions are optimized by adapting the input to the model. The resulting images constitute the signature for the model, which is used by a downstream classifier to tell if the model is Trojan or benign, after an optional feature extraction step.
Figure 2: An illustration of universal Trojan attacks. During training, a trigger is embedded in samples from all classes (victim classes) and the contaminated samples are all given one class label (the target class), which is class $K$ in this illustration. During testing, clean inputs are classified correctly. The inputs with the embedded trigger are classified as class $K$.
Figure 3: The activation optimization process. Starting from a random image, an activation optimization map is derived using gradient descent on a loss function based on the classification logits.
Figure 4: Sample signatures from the CIFAR10 Trojan dataset. Each signature has 20 images corresponding to the 10 classes of the dataset. The top two rows of each signature are for the activation minimization maps while the bottom two rows are for the activation maximization maps. Note how the trigger has a clear fingerprint in the minimization maps for the signature of the Trojan model.
Figure 5: Sample triggers from our dataset. Each trigger is $32\times 32$. Triggers are created by resizing $5\times 5$ random patches to $32\times 32$ using bicubic interpolation. Each Trojan model is trained with a unique trigger.
...and 6 more figures

TRIGS: Trojan Identification from Gradient-based Signatures

TL;DR

Abstract

TRIGS: Trojan Identification from Gradient-based Signatures

Authors

TL;DR

Abstract

Table of Contents

Figures (11)