Table of Contents
Fetching ...

TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time (Extended Version)

Zeliang Kan, Shae McFadden, Daniel Arp, Feargus Pendlebury, Roberto Jordaney, Johannes Kinder, Fabio Pierazzi, Lorenzo Cavallaro

TL;DR

This work identifies and formalizes spatial and temporal biases that inflate malware-detection performance in non-stationary environments. It introduces three constraints (C1–C3), the Area Under Time (AUT) metric, and a training-ratio tuning method to enable fair, time-aware evaluation across Android, Windows PE, and PDF domains. Through Android-scale experiments and Windows/PDF case studies, the authors show that biases can significantly distort reported gains and that targeted tuning can improve robustness to time decay. They implement Tesseract, an open-source framework that enforces the constraints, computes AUT, and supports bias-aware retraining strategies. The study advocates periodic retraining and careful evaluation to achieve stable, real-world performance in malware classification.

Abstract

Machine learning (ML) plays a pivotal role in detecting malicious software. Despite the high F1-scores reported in numerous studies reaching upwards of 0.99, the issue is not completely solved. Malware detectors often experience performance decay due to constantly evolving operating systems and attack methods, which can render previously learned knowledge insufficient for accurate decision-making on new inputs. This paper argues that commonly reported results are inflated due to two pervasive sources of experimental bias in the detection task: spatial bias caused by data distributions that are not representative of a real-world deployment; and temporal bias caused by incorrect time splits of data, leading to unrealistic configurations. To address these biases, we introduce a set of constraints for fair experiment design, and propose a new metric, AUT, for classifier robustness in real-world settings. We additionally propose an algorithm designed to tune training data to enhance classifier performance. Finally, we present TESSERACT, an open-source framework for realistic classifier comparison. Our evaluation encompasses both traditional ML and deep learning methods, examining published works on an extensive Android dataset with 259,230 samples over a five-year span. Additionally, we conduct case studies in the Windows PE and PDF domains. Our findings identify the existence of biases in previous studies and reveal that significant performance enhancements are possible through appropriate, periodic tuning. We explore how mitigation strategies may support in achieving a more stable and better performance over time by employing multiple strategies to delay performance decay.

TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time (Extended Version)

TL;DR

This work identifies and formalizes spatial and temporal biases that inflate malware-detection performance in non-stationary environments. It introduces three constraints (C1–C3), the Area Under Time (AUT) metric, and a training-ratio tuning method to enable fair, time-aware evaluation across Android, Windows PE, and PDF domains. Through Android-scale experiments and Windows/PDF case studies, the authors show that biases can significantly distort reported gains and that targeted tuning can improve robustness to time decay. They implement Tesseract, an open-source framework that enforces the constraints, computes AUT, and supports bias-aware retraining strategies. The study advocates periodic retraining and careful evaluation to achieve stable, real-world performance in malware classification.

Abstract

Machine learning (ML) plays a pivotal role in detecting malicious software. Despite the high F1-scores reported in numerous studies reaching upwards of 0.99, the issue is not completely solved. Malware detectors often experience performance decay due to constantly evolving operating systems and attack methods, which can render previously learned knowledge insufficient for accurate decision-making on new inputs. This paper argues that commonly reported results are inflated due to two pervasive sources of experimental bias in the detection task: spatial bias caused by data distributions that are not representative of a real-world deployment; and temporal bias caused by incorrect time splits of data, leading to unrealistic configurations. To address these biases, we introduce a set of constraints for fair experiment design, and propose a new metric, AUT, for classifier robustness in real-world settings. We additionally propose an algorithm designed to tune training data to enhance classifier performance. Finally, we present TESSERACT, an open-source framework for realistic classifier comparison. Our evaluation encompasses both traditional ML and deep learning methods, examining published works on an extensive Android dataset with 259,230 samples over a five-year span. Additionally, we conduct case studies in the Windows PE and PDF domains. Our findings identify the existence of biases in previous studies and reveal that significant performance enhancements are possible through appropriate, periodic tuning. We explore how mitigation strategies may support in achieving a more stable and better performance over time by employing multiple strategies to delay performance decay.
Paper Structure (31 sections, 6 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 31 sections, 6 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Data distribution of the Android dataset using for this study. The figure shows a stacked histogram illustrating the monthly distribution of Android APKs we sourced from AndroZoo. It comprises $259,230$ Android applications, with approximately 10% being malware each month, covering a period from Jan. 2014 to Dec. 2018. The vertical dashed line indicates the split for all time-aware experiments in this study, with training data from 2014 and testing data from 2015 to 2018 if there is no special illustration.
  • Figure 2: Performance before and after feature selection. 'All features' stands for the approach's performance on the full extracted Drebin feature space, and 'Top-10k features' represents the performance with the most important 10,000 features which selected based on the weight vector of classifier.
  • Figure 3: Spatial experimental bias in testing. The models are trained on data from 2014 and tested on data from the remaining four years. In this unrealistic setting, where the percentage of malware in the testing is artificially increased, Precision for malware increases while Recall remains similar. Consequently, the overall $F_1$-Score also increases with the rising percentage of malware in the testing. However, it is important to note that this setting with more malware than goodware in testing does not reflect the true in-the-wild distribution of 10% malware (§ \ref{['subsec:Malware Ratio']}), rendering it unrealistic and leading to biased results.
  • Figure 4: Motivating example for the intuition of spatial experimental bias in training with Linear-SVM and two features, $x_1$ and $x_2$. The training changes, but the testing points are fixed: 90% gw and 10% mw. When the percentage of malware in the training increases, the decision boundary moves towards the goodware class, improving Recall for malware but decreasing Precision.
  • Figure 5: Spatial experimental bias in training. The models were trained on data from 2014 and tested on the remaining four years data. As the percentage of malware in the training set increases, Precision decreases while Recall increases, aligning with the motivations illustrated in the example of \ref{['fig:toysvm']}. In § \ref{['subsec:tuning alg']}, we present an algorithm to determine the optimal training configuration for optimizing Precision, Recall, or $F_1$-Score based on user requirements.
  • ...and 8 more figures