Table of Contents
Fetching ...

Label-efficient Training Updates for Malware Detection over Time

Luca Minnei, Cristian Manca, Giorgio Piras, Angelo Sotgiu, Maura Pintor, Daniele Ghiani, Davide Maiorca, Giorgio Giacinto, Battista Biggio

Abstract

Machine Learning (ML)-based detectors are becoming essential to counter the proliferation of malware. However, common ML algorithms are not designed to cope with the dynamic nature of real-world settings, where both legitimate and malicious software evolve. This distribution drift causes models trained under static assumptions to degrade over time unless they are continuously updated. Regularly retraining these models, however, is expensive, since labeling new acquired data requires costly manual analysis by security experts. To reduce labeling costs and address distribution drift in malware detection, prior work explored active learning (AL) and semi-supervised learning (SSL) techniques. Yet, existing studies (i) are tightly coupled to specific detector architectures and restricted to a specific malware domain, resulting in non-uniform comparisons; and (ii) lack a consistent methodology for analyzing the distribution drift, despite the critical sensitivity of the malware domain to temporal changes. In this work, we bridge this gap by proposing a model-agnostic framework that evaluates an extensive set of AL and SSL techniques, isolated and combined, for Android and Windows malware detection. We show that these techniques, when combined, can reduce manual annotation costs by up to 90% across both domains while achieving comparable detection performance to full-labeling retraining. We also introduce a methodology for feature-level drift analysis that measures feature stability over time, showing its correlation with the detector performance. Overall, our study provides a detailed understanding of how AL and SSL behave under distribution drift and how they can be successfully combined, offering practical insights for the design of effective detectors over time.

Label-efficient Training Updates for Malware Detection over Time

Abstract

Machine Learning (ML)-based detectors are becoming essential to counter the proliferation of malware. However, common ML algorithms are not designed to cope with the dynamic nature of real-world settings, where both legitimate and malicious software evolve. This distribution drift causes models trained under static assumptions to degrade over time unless they are continuously updated. Regularly retraining these models, however, is expensive, since labeling new acquired data requires costly manual analysis by security experts. To reduce labeling costs and address distribution drift in malware detection, prior work explored active learning (AL) and semi-supervised learning (SSL) techniques. Yet, existing studies (i) are tightly coupled to specific detector architectures and restricted to a specific malware domain, resulting in non-uniform comparisons; and (ii) lack a consistent methodology for analyzing the distribution drift, despite the critical sensitivity of the malware domain to temporal changes. In this work, we bridge this gap by proposing a model-agnostic framework that evaluates an extensive set of AL and SSL techniques, isolated and combined, for Android and Windows malware detection. We show that these techniques, when combined, can reduce manual annotation costs by up to 90% across both domains while achieving comparable detection performance to full-labeling retraining. We also introduce a methodology for feature-level drift analysis that measures feature stability over time, showing its correlation with the detector performance. Overall, our study provides a detailed understanding of how AL and SSL behave under distribution drift and how they can be successfully combined, offering practical insights for the design of effective detectors over time.

Paper Structure

This paper contains 14 sections, 17 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of our proposed pipeline combining AL and SSL. The dash‑dot line represents the detector's decision boundary. Given a set of newly collected unlabeled samples (left), AL queries a small set of informative samples for expert annotation (center), while SSL automatically assigns pseudo-labels to high-confidence samples (right), enabling efficient model retraining with reduced labeling cost.
  • Figure 2: F1 score at FPR $=1\%$ over time for RF on ELSA (quarterly) and EMBER (monthly) with a $10\%$ labeling budget. Top: comparison of AL strategies, alongside NR and FL references. Bottom: comparison of SSL strategies under the same budget, with the same references.
  • Figure 3: F1 score at FPR $=1\%$ over time using RF on ELSA (quarterly) and EMBER (monthly), where each AL strategy is combined with its best-performing SSL strategy, including the optimal SSL budget, at a fixed labeling budget of $1\%$ (ELSA) and $2\%$ (EMBER). The lower panel reports the discriminant percentage of features over time, denoted by $\beta$, corresponding to the performance curves above.
  • Figure 4: Scatter plots of F1 score versus $\beta$ of the curve in \ref{['fig:AL_and_SSL_f1_budgets']} for ELSA (right plot) and EMBER (left plot). The insets report Pearson $r$ and Kendall $\tau$ correlations with two-sided permutation-test p-values (10,000 resamples).