Table of Contents
Fetching ...

Scanning Trojaned Models Using Out-of-Distribution Samples

Hossein Mirzaei, Ali Ansari, Bahar Dibaei Nia, Mojtaba Nafez, Moein Madadi, Sepehr Rezaee, Zeinab Sadat Taghavi, Arad Maleki, Kian Shamsaie, Mahdi Hajialilue, Jafar Habibi, Mohammad Sabokrou, Mohammad Hossein Rohban

TL;DR

This work tackles the problem of detecting Trojan (backdoor) backdoors in DNNs without relying on attack-specific assumptions. It introduces TRODO, a universal Trojan scanning method that exploits blind spots by adversarially shifting OOD samples toward in-distribution regions and measuring the resulting change in MSP-based IDscores. The key idea is a data- and label-mapping-agnostic signature that remains effective even when training data is unavailable or the trojaned model is adversarially trained. Empirical results on eight backdoor attacks and TrojAI demonstrate strong accuracy and efficiency, with TRODO-Zero retaining substantial performance without any training data. Theoretical results link increased adversarial risk in near-OOD regions to backdoor strength, supporting TRODO’s robustness and providing insights into vulnerabilities of trojaned classifiers under perturbations. Overall, TRODO offers a practical, scalable approach for Trojan scanning in diverse, data-limited real-world scenarios.

Abstract

Scanning for trojan (backdoor) in deep neural networks is crucial due to their significant real-world applications. There has been an increasing focus on developing effective general trojan scanning methods across various trojan attacks. Despite advancements, there remains a shortage of methods that perform effectively without preconceived assumptions about the backdoor attack method. Additionally, we have observed that current methods struggle to identify classifiers trojaned using adversarial training. Motivated by these challenges, our study introduces a novel scanning method named TRODO (TROjan scanning by Detection of adversarial shifts in Out-of-distribution samples). TRODO leverages the concept of "blind spots"--regions where trojaned classifiers erroneously identify out-of-distribution (OOD) samples as in-distribution (ID). We scan for these blind spots by adversarially shifting OOD samples towards in-distribution. The increased likelihood of perturbed OOD samples being classified as ID serves as a signature for trojan detection. TRODO is both trojan and label mapping agnostic, effective even against adversarially trained trojaned classifiers. It is applicable even in scenarios where training data is absent, demonstrating high accuracy and adaptability across various scenarios and datasets, highlighting its potential as a robust trojan scanning strategy.

Scanning Trojaned Models Using Out-of-Distribution Samples

TL;DR

This work tackles the problem of detecting Trojan (backdoor) backdoors in DNNs without relying on attack-specific assumptions. It introduces TRODO, a universal Trojan scanning method that exploits blind spots by adversarially shifting OOD samples toward in-distribution regions and measuring the resulting change in MSP-based IDscores. The key idea is a data- and label-mapping-agnostic signature that remains effective even when training data is unavailable or the trojaned model is adversarially trained. Empirical results on eight backdoor attacks and TrojAI demonstrate strong accuracy and efficiency, with TRODO-Zero retaining substantial performance without any training data. Theoretical results link increased adversarial risk in near-OOD regions to backdoor strength, supporting TRODO’s robustness and providing insights into vulnerabilities of trojaned classifiers under perturbations. Overall, TRODO offers a practical, scalable approach for Trojan scanning in diverse, data-limited real-world scenarios.

Abstract

Scanning for trojan (backdoor) in deep neural networks is crucial due to their significant real-world applications. There has been an increasing focus on developing effective general trojan scanning methods across various trojan attacks. Despite advancements, there remains a shortage of methods that perform effectively without preconceived assumptions about the backdoor attack method. Additionally, we have observed that current methods struggle to identify classifiers trojaned using adversarial training. Motivated by these challenges, our study introduces a novel scanning method named TRODO (TROjan scanning by Detection of adversarial shifts in Out-of-distribution samples). TRODO leverages the concept of "blind spots"--regions where trojaned classifiers erroneously identify out-of-distribution (OOD) samples as in-distribution (ID). We scan for these blind spots by adversarially shifting OOD samples towards in-distribution. The increased likelihood of perturbed OOD samples being classified as ID serves as a signature for trojan detection. TRODO is both trojan and label mapping agnostic, effective even against adversarially trained trojaned classifiers. It is applicable even in scenarios where training data is absent, demonstrating high accuracy and adaptability across various scenarios and datasets, highlighting its potential as a robust trojan scanning strategy.

Paper Structure

This paper contains 41 sections, 2 theorems, 35 equations, 5 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

(Adversarial risk in near-OOD)

Figures (5)

  • Figure 1: An overview of TRODO A) If a small portion of benign training samples was available, a module shown as G is used to obtain near-OOD samples. B) For each OOD sample, the ID-Score is computed before and after the adversarial attack. The difference between these scores is used as a signature to distinguish between a clean and a trojaned classifier. Performing the adversarial with not a large budget helps to discriminate between benign and trojaned classifiers 1) Lack of blind spots in the learned decision boundary of a clean model, makes it difficult to increase the ID-Score of OOD samples, resulting in small change in ID-Score. 2) For a trojaned model, $\Delta \text{ID-Score}$ is more discernible. This is due to the presence of blind spots, making it easier to shift OOD samples inside the decision boundary.
  • Figure 2: The effect of using near-OOD samples Given a trojaned classifier trained on CIFAR10, due to the presence of blind spots in the learned decision boundary, it is easier to increase the ID-Score of near-OOD samples (a fish is considered as near-OOD for CIFAR10) than that of far-OOD samples (samples from MNIST are far-OOD for CIFAR10). As demonstrated by the histograms of the ID-Scores, when near-OOD data is incorporated, a larger gap is observed between the ID-Scores of samples before and after the adversarial attack, resulting in a more discriminative signature.
  • Figure 3: Model accuracy across different architectures and datasets. Trojaned models for all backdoor attacks show a consistent slight decrease in accuracy compared to clean models, suggesting benign overfitting in Trojaned classifiers.
  • Figure 4: The effect of overlaying triggers on OOD data, in various attacks. As demonstrated, applying the trigger (which is used to poison training data) on even far-OOD samples, fools the model into identifying them as ID. This is due to the benign overfitting on the trigger present in the training data.
  • Figure 5: Examples of ID samples and their corresponding crafted near-OOD samples. We used Elastic albumenations, random rotations, and cutpaste cutpaste.

Theorems & Definitions (6)

  • Theorem 1
  • Remark 1
  • Theorem 2
  • proof
  • proof
  • proof