On the Reliability and Stability of Selective Methods in Malware Classification Tasks
Alexander Herzog, Aliai Eusebi, Lorenzo Cavallaro
TL;DR
The paper tackles the weakness of accuracy-centric evaluation for drift-laden malware detectors by introducing Aurora, a framework that evaluates confidence quality and operational resilience under distribution shifts. It employs risk-coverage based metrics (AURC and RC curves), online abstention protocols, and temporal stability measures to quantify how well a model's confidence ranks errors and maintains stable behavior over time. Across three Android malware datasets, the study shows that strong offline accuracy does not guarantee reliable decision-making under rejection, with simple baselines like DeepDrebin often outperforming more complex SOTA methods in reliability, calibration, and stability. The findings advocate for multi-dimensional deployment-oriented evaluation, including confidence ranking, calibration, and budget-aware rejection behavior, to guide method selection and data-start strategies (e.g., initial data subsampling) that improve practical robustness in evolving security contexts.
Abstract
The performance figures of modern drift-adaptive malware classifiers appear promising, but does this translate to genuine operational reliability? The standard evaluation paradigm primarily focuses on baseline performance metrics, neglecting confidence-error alignment and operational stability. While prior works established the importance of temporal evaluation and introduced selective classification in malware classification tasks, we take a complementary direction by investigating whether malware classifiers maintain reliable and stable confidence estimates under distribution shifts and exploring the tensions between scientific advancement and practical impacts when they do not. We propose Aurora, a framework to evaluate malware classifiers based on their confidence quality and operational resilience. Aurora subjects the confidence profile of a given model to verification to assess the reliability of its estimates. Unreliable confidence estimates erode operational trust, waste valuable annotation budgets on non-informative samples for active learning, and leave error-prone instances undetected in selective classification. Aurora is further complemented by a set of metrics designed to go beyond point-in-time performance, striving towards a more holistic assessment of operational stability throughout temporal evaluation periods. The fragility we observe in SOTA frameworks across datasets of varying drift severity suggests it may be time to revisit the underlying assumptions.
