Table of Contents
Fetching ...

SLIFER: Investigating Performance and Robustness of Malware Detection Pipelines

Andrea Ponte, Dmitrijs Trizna, Luca Demetrio, Battista Biggio, Ivan Tesfai Ogbu, Fabio Roli

TL;DR

SLIFER addresses the production gap in malware detection by implementing a sequential pipeline that uses signature matching, static ML, and dynamic emulation, activating the costly dynamic stage only when needed. It systematically studies error handling for pre-processing failures, calibration of module thresholds, and robustness against adversarial EXEmples, reporting that treating undecodable samples as benign reduces false alarms and that static components dominate robustness. Empirically, SLIFER outperforms single-model baselines at low FPR and maintains near-zero overhead relative to static/dynamic baselines, with calibration offering modest adjustments. The work provides practical guidance for deploying sequential malware detectors in real environments and highlights the limited added value of dynamic analysis in the sequential setting.

Abstract

As a result of decades of research, Windows malware detection is approached through a plethora of techniques. However, there is an ongoing mismatch between academia -- which pursues an optimal performances in terms of detection rate and low false alarms -- and the requirements of real-world scenarios. In particular, academia focuses on combining static and dynamic analysis within a single or ensemble of models, falling into several pitfalls like (i) firing dynamic analysis without considering the computational burden it requires; (ii) discarding impossible-to-analyze samples; and (iii) analyzing robustness against adversarial attacks without considering that malware detectors are complemented with more non-machine-learning components. Thus, in this paper we bridge these gaps, by investigating the properties of malware detectors built with multiple and different types of analysis. To do so, we develop SLIFER, a Windows malware detection pipeline sequentially leveraging both static and dynamic analysis, interrupting computations as soon as one module triggers an alarm, requiring dynamic analysis only when needed. Contrary to the state of the art, we investigate how to deal with samples that impede analyzes, showing how much they impact performances, concluding that it is better to flag them as legitimate to not drastically increase false alarms. Lastly, we perform a robustness evaluation of SLIFER. Counter-intuitively, the injection of new content is either blocked more by signatures than dynamic analysis, due to byte artifacts created by the attack, or it is able to avoid detection from signatures, as they rely on constraints on file size disrupted by attacks. As far as we know, we are the first to investigate the properties of sequential malware detectors, shedding light on their behavior in real production environment.

SLIFER: Investigating Performance and Robustness of Malware Detection Pipelines

TL;DR

SLIFER addresses the production gap in malware detection by implementing a sequential pipeline that uses signature matching, static ML, and dynamic emulation, activating the costly dynamic stage only when needed. It systematically studies error handling for pre-processing failures, calibration of module thresholds, and robustness against adversarial EXEmples, reporting that treating undecodable samples as benign reduces false alarms and that static components dominate robustness. Empirically, SLIFER outperforms single-model baselines at low FPR and maintains near-zero overhead relative to static/dynamic baselines, with calibration offering modest adjustments. The work provides practical guidance for deploying sequential malware detectors in real environments and highlights the limited added value of dynamic analysis in the sequential setting.

Abstract

As a result of decades of research, Windows malware detection is approached through a plethora of techniques. However, there is an ongoing mismatch between academia -- which pursues an optimal performances in terms of detection rate and low false alarms -- and the requirements of real-world scenarios. In particular, academia focuses on combining static and dynamic analysis within a single or ensemble of models, falling into several pitfalls like (i) firing dynamic analysis without considering the computational burden it requires; (ii) discarding impossible-to-analyze samples; and (iii) analyzing robustness against adversarial attacks without considering that malware detectors are complemented with more non-machine-learning components. Thus, in this paper we bridge these gaps, by investigating the properties of malware detectors built with multiple and different types of analysis. To do so, we develop SLIFER, a Windows malware detection pipeline sequentially leveraging both static and dynamic analysis, interrupting computations as soon as one module triggers an alarm, requiring dynamic analysis only when needed. Contrary to the state of the art, we investigate how to deal with samples that impede analyzes, showing how much they impact performances, concluding that it is better to flag them as legitimate to not drastically increase false alarms. Lastly, we perform a robustness evaluation of SLIFER. Counter-intuitively, the injection of new content is either blocked more by signatures than dynamic analysis, due to byte artifacts created by the attack, or it is able to avoid detection from signatures, as they rely on constraints on file size disrupted by attacks. As far as we know, we are the first to investigate the properties of sequential malware detectors, shedding light on their behavior in real production environment.
Paper Structure (17 sections, 4 figures, 13 tables)

This paper contains 17 sections, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Graphical representation of the Windows PE format.
  • Figure 2: The architecture of SLIFER. Input traverses the pipeline by triggering each module sequentially, and computations are halted when one module raises an alert. In presence of pre-processing errors, SLIFER forwards the input to the next module. This can happen in non-end-to-end modules such as GBDT-EMBER and Nebula.
  • Figure 3: ROC curve of Nebula and Neurlux, tested on $\mathcal{D}_2\,$. To compute these curves we discard impossible-to-analyze samples.
  • Figure 4: ROC curve of all the models under test. We take a decision threshold for each model fixing FPR at 6.7% as scored in the validation process, to have a fair comparison with SLIFER.