Table of Contents
Fetching ...

PBP: Post-training Backdoor Purification for Malware Classifiers

Dung Thuy Nguyen, Ngoc N. Tran, Taylor T. Johnson, Kevin Leach

TL;DR

PBP addresses the vulnerability of malware classifiers to backdoor poisoning by offering a post-training purification method that does not require knowledge of the embedding mechanism. It identifies backdoor neurons via activation-drift and BN-statistics alignment, then uses a two-phase approach—neuron masking and activation-shift fine-tuning—to erase backdoors using only a small clean data subset (as little as 1%). The method demonstrates strong empirical results, achieving near-zero attack success rates across multiple backdoor strategies, datasets, and attacker-power settings, outperforming existing fine-tuning baselines. This data-efficient, model-agnostic defense is particularly relevant for MLaaS and third-party pretrained malware detectors, with potential applicability to broader AI security contexts.

Abstract

In recent years, the rise of machine learning (ML) in cybersecurity has brought new challenges, including the increasing threat of backdoor poisoning attacks on ML malware classifiers. For instance, adversaries could inject malicious samples into public malware repositories, contaminating the training data and potentially misclassifying malware by the ML model. Current countermeasures predominantly focus on detecting poisoned samples by leveraging disagreements within the outputs of a diverse set of ensemble models on training data points. However, these methods are not suitable for scenarios where Machine Learning-as-a-Service (MLaaS) is used or when users aim to remove backdoors from a model after it has been trained. Addressing this scenario, we introduce PBP, a post-training defense for malware classifiers that mitigates various types of backdoor embeddings without assuming any specific backdoor embedding mechanism. Our method exploits the influence of backdoor attacks on the activation distribution of neural networks, independent of the trigger-embedding method. In the presence of a backdoor attack, the activation distribution of each layer is distorted into a mixture of distributions. By regulating the statistics of the batch normalization layers, we can guide a backdoored model to perform similarly to a clean one. Our method demonstrates substantial advantages over several state-of-the-art methods, as evidenced by experiments on two datasets, two types of backdoor methods, and various attack configurations. Notably, our approach requires only a small portion of the training data -- only 1\% -- to purify the backdoor and reduce the attack success rate from 100\% to almost 0\%, a 100-fold improvement over the baseline methods. Our code is available at https://github.com/judydnguyen/pbp-backdoor-purification-official.

PBP: Post-training Backdoor Purification for Malware Classifiers

TL;DR

PBP addresses the vulnerability of malware classifiers to backdoor poisoning by offering a post-training purification method that does not require knowledge of the embedding mechanism. It identifies backdoor neurons via activation-drift and BN-statistics alignment, then uses a two-phase approach—neuron masking and activation-shift fine-tuning—to erase backdoors using only a small clean data subset (as little as 1%). The method demonstrates strong empirical results, achieving near-zero attack success rates across multiple backdoor strategies, datasets, and attacker-power settings, outperforming existing fine-tuning baselines. This data-efficient, model-agnostic defense is particularly relevant for MLaaS and third-party pretrained malware detectors, with potential applicability to broader AI security contexts.

Abstract

In recent years, the rise of machine learning (ML) in cybersecurity has brought new challenges, including the increasing threat of backdoor poisoning attacks on ML malware classifiers. For instance, adversaries could inject malicious samples into public malware repositories, contaminating the training data and potentially misclassifying malware by the ML model. Current countermeasures predominantly focus on detecting poisoned samples by leveraging disagreements within the outputs of a diverse set of ensemble models on training data points. However, these methods are not suitable for scenarios where Machine Learning-as-a-Service (MLaaS) is used or when users aim to remove backdoors from a model after it has been trained. Addressing this scenario, we introduce PBP, a post-training defense for malware classifiers that mitigates various types of backdoor embeddings without assuming any specific backdoor embedding mechanism. Our method exploits the influence of backdoor attacks on the activation distribution of neural networks, independent of the trigger-embedding method. In the presence of a backdoor attack, the activation distribution of each layer is distorted into a mixture of distributions. By regulating the statistics of the batch normalization layers, we can guide a backdoored model to perform similarly to a clean one. Our method demonstrates substantial advantages over several state-of-the-art methods, as evidenced by experiments on two datasets, two types of backdoor methods, and various attack configurations. Notably, our approach requires only a small portion of the training data -- only 1\% -- to purify the backdoor and reduce the attack success rate from 100\% to almost 0\%, a 100-fold improvement over the baseline methods. Our code is available at https://github.com/judydnguyen/pbp-backdoor-purification-official.

Paper Structure

This paper contains 45 sections, 2 theorems, 30 equations, 17 figures, 14 tables, 1 algorithm.

Key Result

Theorem 1

Let $\theta_0$ be the initial pretrained weights (i.e., line 13 in algo:main). If the fine-tuning learning rate is satisfiedsatisfies: algo:main will converge.

Figures (17)

  • Figure 1: t-SNE tnse representations of family-targeted and non-family-targeted backdoor attacks.
  • Figure 2: Model activation of backdoor neurons on targeted malware samples with and without a trigger.
  • Figure 3: The overall description of the proposed method. PBP includes two phases: (i) Neuron mask generation and (ii) Activation-shift Fine-tuning. In the first phase, we initialize a noise model $f_{\hat{\theta}}$ and train a new model by using clean data using the objective functions as aligning the neuron activation to the backdoor model $f_{\theta^0}$, determining the most important neurons for this task using Hessian trace during training. In the later phase, the masked gradient optimization process is applied by reversing the gradient sign of the masked neuron (in red). The fine-tuned model is expected not to predict triggered sample, i.e., malware as "benign".
  • Figure 4: Final layer's activation of non-family-targeted backdoor attacks on triggered samples.
  • Figure 5: Final layer's activation of family-targeted backdoor attacks on triggered samples.
  • ...and 12 more figures

Theorems & Definitions (9)

  • Definition 1: Clean Data Accuracy (C-AccCDA)
  • Definition 2: Attack Success Rate (ASR)
  • Definition 3: Defense Effectiveness Rating (DER) zhu2023enhancing
  • Definition 4: Backdoor Neurons
  • Definition 5: Backdoor Sensitivity
  • Definition 6: Backdoor Purification
  • Theorem 1
  • Theorem \ref{thm:convergence}
  • proof