Table of Contents
Fetching ...

NeuronInspect: Detecting Backdoors in Neural Networks via Output Explanations

Xijie Huang, Moustafa Alzantot, Mani Srivastava

TL;DR

NeuronInspect tackles Trojan backdoor detection in DNNs by leveraging output explanation heatmaps to identify attack targets without needing backdoor samples or trigger restoration. It introduces three explanation-derived features—sparseness, smoothness, and persistence—and combines them into an anomaly score detected via MAD-based outlier analysis. Empirical results on MNIST and GTSRB show superior robustness and dramatically better efficiency compared with Neural Cleanse, across varying trigger sizes, locations, and even translucent triggers. The approach demonstrates the practical viability of using explainability signals for reliable backdoor defense in real-world MLaaS deployments.

Abstract

Deep neural networks have achieved state-of-the-art performance on various tasks. However, lack of interpretability and transparency makes it easier for malicious attackers to inject trojan backdoor into the neural networks, which will make the model behave abnormally when a backdoor sample with a specific trigger is input. In this paper, we propose NeuronInspect, a framework to detect trojan backdoors in deep neural networks via output explanation techniques. NeuronInspect first identifies the existence of backdoor attack targets by generating the explanation heatmap of the output layer. We observe that generated heatmaps from clean and backdoored models have different characteristics. Therefore we extract features that measure the attributes of explanations from an attacked model namely: sparse, smooth and persistent. We combine these features and use outlier detection to figure out the outliers, which is the set of attack targets. We demonstrate the effectiveness and efficiency of NeuronInspect on MNIST digit recognition dataset and GTSRB traffic sign recognition dataset. We extensively evaluate NeuronInspect on different attack scenarios and prove better robustness and effectiveness over state-of-the-art trojan backdoor detection techniques Neural Cleanse by a great margin.

NeuronInspect: Detecting Backdoors in Neural Networks via Output Explanations

TL;DR

NeuronInspect tackles Trojan backdoor detection in DNNs by leveraging output explanation heatmaps to identify attack targets without needing backdoor samples or trigger restoration. It introduces three explanation-derived features—sparseness, smoothness, and persistence—and combines them into an anomaly score detected via MAD-based outlier analysis. Empirical results on MNIST and GTSRB show superior robustness and dramatically better efficiency compared with Neural Cleanse, across varying trigger sizes, locations, and even translucent triggers. The approach demonstrates the practical viability of using explainability signals for reliable backdoor defense in real-world MLaaS deployments.

Abstract

Deep neural networks have achieved state-of-the-art performance on various tasks. However, lack of interpretability and transparency makes it easier for malicious attackers to inject trojan backdoor into the neural networks, which will make the model behave abnormally when a backdoor sample with a specific trigger is input. In this paper, we propose NeuronInspect, a framework to detect trojan backdoors in deep neural networks via output explanation techniques. NeuronInspect first identifies the existence of backdoor attack targets by generating the explanation heatmap of the output layer. We observe that generated heatmaps from clean and backdoored models have different characteristics. Therefore we extract features that measure the attributes of explanations from an attacked model namely: sparse, smooth and persistent. We combine these features and use outlier detection to figure out the outliers, which is the set of attack targets. We demonstrate the effectiveness and efficiency of NeuronInspect on MNIST digit recognition dataset and GTSRB traffic sign recognition dataset. We extensively evaluate NeuronInspect on different attack scenarios and prove better robustness and effectiveness over state-of-the-art trojan backdoor detection techniques Neural Cleanse by a great margin.

Paper Structure

This paper contains 30 sections, 10 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: The Processing steps of NeuronInspect for backdoor detection. Explanation heat-maps are generated to explain the classifier output on different clean input images and different output labels. We observe that generated heatmaps from a trojaned network (shown at the third row) have distinguishing characteristics that we employ for backdoor detection.
  • Figure 2: Illustration of the Trojan Backdoor Attack.
  • Figure 3: Sensitivity Analysis on size of the trigger
  • Figure 4: Results comparison of NeuronInspect and Neural Cleanse on GTSRB dataset with multiple triggers
  • Figure 5: Original image without any trigger
  • ...and 8 more figures