Table of Contents
Fetching ...

On the Effectiveness of Adversarial Training on Malware Classifiers

Hamid Bostani, Jacopo Cortellazzi, Daniel Arp, Fabio Pierazzi, Veelasha Moonsamy, Lorenzo Cavallaro

TL;DR

This paper tackles a central question: how effective is adversarial training for malware classifiers in real-world, discrete feature spaces? It introduces Rubik, a unified, multidimensional evaluation framework that jointly analyzes data, feature representations, classifier types, and robust optimization settings, applied to Android malware with static representations such as DREBIN and RAMDA. Through systematic experiments across datasets, attacks (realistic and unrealistic), and domain constraints, the study reveals that AT’s benefits are conditional on model architecture, feature-space structure, and the realism of adversarial examples, challenging prior assumptions about universal gains from realizable or high-confidence AEs. The findings offer practical recommendations to balance clean accuracy and robustness, underscore the importance of domain-aware evaluation, and stress that robust malware detectors require carefully aligned end-to-end configurations rather than one-size-fits-all defenses.

Abstract

Adversarial Training (AT) is a key defense against Machine Learning evasion attacks, but its effectiveness for real-world malware detection remains poorly understood. This uncertainty stems from a critical disconnect in prior research: studies often overlook the inherent nature of malware and are fragmented, examining diverse variables like realism or confidence of adversarial examples in isolation, or relying on weak evaluations that yield non-generalizable insights. To address this, we introduce Rubik, a framework for the systematic, multi-dimensional evaluation of AT in the malware domain. This framework defines diverse key factors across essential dimensions, including data, feature representations, classifiers, and robust optimization settings, for a comprehensive exploration of the interplay of influential AT's variables through reliable evaluation practices, such as realistic evasion attacks. We instantiate Rubik on Android malware, empirically analyzing how this interplay shapes robustness. Our findings challenge prior beliefs--showing, for instance, that realizable adversarial examples offer only conditional robustness benefits--and reveal new insights, such as the critical role of model architecture and feature-space structure in determining AT's success. From this analysis, we distill four key insights, expose four common evaluation misconceptions, and offer practical recommendations to guide the development of truly robust malware classifiers.

On the Effectiveness of Adversarial Training on Malware Classifiers

TL;DR

This paper tackles a central question: how effective is adversarial training for malware classifiers in real-world, discrete feature spaces? It introduces Rubik, a unified, multidimensional evaluation framework that jointly analyzes data, feature representations, classifier types, and robust optimization settings, applied to Android malware with static representations such as DREBIN and RAMDA. Through systematic experiments across datasets, attacks (realistic and unrealistic), and domain constraints, the study reveals that AT’s benefits are conditional on model architecture, feature-space structure, and the realism of adversarial examples, challenging prior assumptions about universal gains from realizable or high-confidence AEs. The findings offer practical recommendations to balance clean accuracy and robustness, underscore the importance of domain-aware evaluation, and stress that robust malware detectors require carefully aligned end-to-end configurations rather than one-size-fits-all defenses.

Abstract

Adversarial Training (AT) is a key defense against Machine Learning evasion attacks, but its effectiveness for real-world malware detection remains poorly understood. This uncertainty stems from a critical disconnect in prior research: studies often overlook the inherent nature of malware and are fragmented, examining diverse variables like realism or confidence of adversarial examples in isolation, or relying on weak evaluations that yield non-generalizable insights. To address this, we introduce Rubik, a framework for the systematic, multi-dimensional evaluation of AT in the malware domain. This framework defines diverse key factors across essential dimensions, including data, feature representations, classifiers, and robust optimization settings, for a comprehensive exploration of the interplay of influential AT's variables through reliable evaluation practices, such as realistic evasion attacks. We instantiate Rubik on Android malware, empirically analyzing how this interplay shapes robustness. Our findings challenge prior beliefs--showing, for instance, that realizable adversarial examples offer only conditional robustness benefits--and reveal new insights, such as the critical role of model architecture and feature-space structure in determining AT's success. From this analysis, we distill four key insights, expose four common evaluation misconceptions, and offer practical recommendations to guide the development of truly robust malware classifiers.

Paper Structure

This paper contains 31 sections, 5 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: The impact of feature representation on data distribution and blind spot coverage, showing potentially (a) larger, more vulnerable regions and (b) smaller, less vulnerable ones.
  • Figure 2: Demonstrating the impact of classifiers on AT: (a, b) A linear classifier may forget earlier adjustments (at $t$) when facing new AEs (at $t+1$); (c, d) A non-linear classifier better adapts to new AEs.
  • Figure 3: Illustrating how different settings affect robust optimization: (a) larger perturbation bounds (e.g., $\epsilon_2$) and high-confidence AEs (e.g., $s_2$) can reveal more blind spots; (b) varying malware sets lead to different blind spot coverage, e.g., $s_1$, $s_2$, and $s_3$ being more effective than $s_1$, $s_2$; (c) targeting blind spots within the feasible space is sufficient, as only realizable AEs (e.g., $s_1$, $s_3$) represent practical threats.
  • Figure 4: Illustration of our unified framework proposed to investigate the influence of various key dimensions on the performance of malware classifiers.
  • Figure 5: Clean performance of various models trained on (a) the DREBIN and (b) RAMDA representations of the DREBIN20 dataset, and (c) the DREBIN and (d) RAMDA representations of the APIGraph dataset, measured in terms of F1 score. The models are strengthened using either PGD or JSMA with different perturbation bounds. The F-Scores of different vanilla models are displayed with a perturbation bound of 0.
  • ...and 8 more figures