Table of Contents
Fetching ...

TABOR: A Highly Accurate Approach to Inspecting and Restoring Trojan Backdoors in AI Systems

Wenbo Guo, Lun Wang, Xinyu Xing, Min Du, Dawn Song

TL;DR

This work tackles inspecting Trojan backdoors in AI systems under black-box constraints, where training data is unavailable. It reframes Trojan detection as a non-convex optimization task and adds explainable-AI–inspired regularizations plus a new trigger-quality metric to shrink the search space and suppress false alarms, improving trigger fidelity. Compared with Neural Cleanse, TABOR demonstrates higher accuracy in detection and trigger restoration across varied trojan configurations and models, showing robustness to trigger size, shape, and location. The approach enables practical security auditing and patching of deployed DNNs without access to training data or internal weights.

Abstract

A trojan backdoor is a hidden pattern typically implanted in a deep neural network. It could be activated and thus forces that infected model behaving abnormally only when an input data sample with a particular trigger present is fed to that model. As such, given a deep neural network model and clean input samples, it is very challenging to inspect and determine the existence of a trojan backdoor. Recently, researchers design and develop several pioneering solutions to address this acute problem. They demonstrate the proposed techniques have a great potential in trojan detection. However, we show that none of these existing techniques completely address the problem. On the one hand, they mostly work under an unrealistic assumption (e.g. assuming availability of the contaminated training database). On the other hand, the proposed techniques cannot accurately detect the existence of trojan backdoors, nor restore high-fidelity trojan backdoor images, especially when the triggers pertaining to the trojan vary in size, shape and position. In this work, we propose TABOR, a new trojan detection technique. Conceptually, it formalizes a trojan detection task as a non-convex optimization problem, and the detection of a trojan backdoor as the task of resolving the optimization through an objective function. Different from the existing technique also modeling trojan detection as an optimization problem, TABOR designs a new objective function--under the guidance of explainable AI techniques as well as heuristics--that could guide optimization to identify a trojan backdoor in a more effective fashion. In addition, TABOR defines a new metric to measure the quality of a trojan backdoor identified. Using an anomaly detection method, we show the new metric could better facilitate TABOR to identify intentionally injected triggers in an infected model and filter out false alarms......

TABOR: A Highly Accurate Approach to Inspecting and Restoring Trojan Backdoors in AI Systems

TL;DR

This work tackles inspecting Trojan backdoors in AI systems under black-box constraints, where training data is unavailable. It reframes Trojan detection as a non-convex optimization task and adds explainable-AI–inspired regularizations plus a new trigger-quality metric to shrink the search space and suppress false alarms, improving trigger fidelity. Compared with Neural Cleanse, TABOR demonstrates higher accuracy in detection and trigger restoration across varied trojan configurations and models, showing robustness to trigger size, shape, and location. The approach enables practical security auditing and patching of deployed DNNs without access to training data or internal weights.

Abstract

A trojan backdoor is a hidden pattern typically implanted in a deep neural network. It could be activated and thus forces that infected model behaving abnormally only when an input data sample with a particular trigger present is fed to that model. As such, given a deep neural network model and clean input samples, it is very challenging to inspect and determine the existence of a trojan backdoor. Recently, researchers design and develop several pioneering solutions to address this acute problem. They demonstrate the proposed techniques have a great potential in trojan detection. However, we show that none of these existing techniques completely address the problem. On the one hand, they mostly work under an unrealistic assumption (e.g. assuming availability of the contaminated training database). On the other hand, the proposed techniques cannot accurately detect the existence of trojan backdoors, nor restore high-fidelity trojan backdoor images, especially when the triggers pertaining to the trojan vary in size, shape and position. In this work, we propose TABOR, a new trojan detection technique. Conceptually, it formalizes a trojan detection task as a non-convex optimization problem, and the detection of a trojan backdoor as the task of resolving the optimization through an objective function. Different from the existing technique also modeling trojan detection as an optimization problem, TABOR designs a new objective function--under the guidance of explainable AI techniques as well as heuristics--that could guide optimization to identify a trojan backdoor in a more effective fashion. In addition, TABOR defines a new metric to measure the quality of a trojan backdoor identified. Using an anomaly detection method, we show the new metric could better facilitate TABOR to identify intentionally injected triggers in an infected model and filter out false alarms......

Paper Structure

This paper contains 13 sections, 8 equations, 4 figures.

Figures (4)

  • Figure 1: The illustration of trigger insertion. Note that the gray mark is the trigger and $\mathbf{M}$ is the mask matrix with the elements in the trigger-presented-region equal to '1' whereas all the others equal to '0'.
  • Figure 2: The illustration of observed false alarms and incorrect triggers.
  • Figure 3: The illustration of knocking off irrelevant features that are part of identified trojan backdoor. Note that the red box indicates the important features pinpointed through an explanation AI technique.
  • Figure 4: The demonstrations of the victim models that are trained on ImageNet dataset and are infected by the triggers with different shapes, locations and sizes.