Table of Contents
Fetching ...

On the Reliability and Stability of Selective Methods in Malware Classification Tasks

Alexander Herzog, Aliai Eusebi, Lorenzo Cavallaro

TL;DR

The paper tackles the weakness of accuracy-centric evaluation for drift-laden malware detectors by introducing Aurora, a framework that evaluates confidence quality and operational resilience under distribution shifts. It employs risk-coverage based metrics (AURC and RC curves), online abstention protocols, and temporal stability measures to quantify how well a model's confidence ranks errors and maintains stable behavior over time. Across three Android malware datasets, the study shows that strong offline accuracy does not guarantee reliable decision-making under rejection, with simple baselines like DeepDrebin often outperforming more complex SOTA methods in reliability, calibration, and stability. The findings advocate for multi-dimensional deployment-oriented evaluation, including confidence ranking, calibration, and budget-aware rejection behavior, to guide method selection and data-start strategies (e.g., initial data subsampling) that improve practical robustness in evolving security contexts.

Abstract

The performance figures of modern drift-adaptive malware classifiers appear promising, but does this translate to genuine operational reliability? The standard evaluation paradigm primarily focuses on baseline performance metrics, neglecting confidence-error alignment and operational stability. While prior works established the importance of temporal evaluation and introduced selective classification in malware classification tasks, we take a complementary direction by investigating whether malware classifiers maintain reliable and stable confidence estimates under distribution shifts and exploring the tensions between scientific advancement and practical impacts when they do not. We propose Aurora, a framework to evaluate malware classifiers based on their confidence quality and operational resilience. Aurora subjects the confidence profile of a given model to verification to assess the reliability of its estimates. Unreliable confidence estimates erode operational trust, waste valuable annotation budgets on non-informative samples for active learning, and leave error-prone instances undetected in selective classification. Aurora is further complemented by a set of metrics designed to go beyond point-in-time performance, striving towards a more holistic assessment of operational stability throughout temporal evaluation periods. The fragility we observe in SOTA frameworks across datasets of varying drift severity suggests it may be time to revisit the underlying assumptions.

On the Reliability and Stability of Selective Methods in Malware Classification Tasks

TL;DR

The paper tackles the weakness of accuracy-centric evaluation for drift-laden malware detectors by introducing Aurora, a framework that evaluates confidence quality and operational resilience under distribution shifts. It employs risk-coverage based metrics (AURC and RC curves), online abstention protocols, and temporal stability measures to quantify how well a model's confidence ranks errors and maintains stable behavior over time. Across three Android malware datasets, the study shows that strong offline accuracy does not guarantee reliable decision-making under rejection, with simple baselines like DeepDrebin often outperforming more complex SOTA methods in reliability, calibration, and stability. The findings advocate for multi-dimensional deployment-oriented evaluation, including confidence ranking, calibration, and budget-aware rejection behavior, to guide method selection and data-start strategies (e.g., initial data subsampling) that improve practical robustness in evolving security contexts.

Abstract

The performance figures of modern drift-adaptive malware classifiers appear promising, but does this translate to genuine operational reliability? The standard evaluation paradigm primarily focuses on baseline performance metrics, neglecting confidence-error alignment and operational stability. While prior works established the importance of temporal evaluation and introduced selective classification in malware classification tasks, we take a complementary direction by investigating whether malware classifiers maintain reliable and stable confidence estimates under distribution shifts and exploring the tensions between scientific advancement and practical impacts when they do not. We propose Aurora, a framework to evaluate malware classifiers based on their confidence quality and operational resilience. Aurora subjects the confidence profile of a given model to verification to assess the reliability of its estimates. Unreliable confidence estimates erode operational trust, waste valuable annotation budgets on non-informative samples for active learning, and leave error-prone instances undetected in selective classification. Aurora is further complemented by a set of metrics designed to go beyond point-in-time performance, striving towards a more holistic assessment of operational stability throughout temporal evaluation periods. The fragility we observe in SOTA frameworks across datasets of varying drift severity suggests it may be time to revisit the underlying assumptions.

Paper Structure

This paper contains 64 sections, 11 equations, 6 figures, 1 table, 3 algorithms.

Figures (6)

  • Figure 1: CADE CADE assumes that errors correlate with increased distance from cluster centers (case A). However, our evaluation with Aurora using AURC (§\ref{['subsub:auroc']}), which measures the error rate as a function of coverage when samples are ranked by confidence, reveals the opposite trend. Misclassifications often occur near cluster centers (case B), placing them in regions of high confidence. This contradicts the core premise of CADE's distance-based OOD scoring (see Figure \ref{['fig:risk_coverage_tau50']} and §\ref{['sec:results']}). Notably, this failure is not limited to the native CADE score but also extends to softmax-based uncertainty, both of which wrongly assign high confidence to erroneous predictions. Interestingly, on some datasets, a simple MLP trained on CADE embeddings -- using softmax confidence as a proxy for uncertainty -- outperforms CADE's native distance-based OOD metric. Softmax confidence declines near the decision boundary, aligning better with classifier uncertainty and offering a more reliable OOD indicator on_the_calibration_of_modern_neural_networks.
  • Figure 2: Risk-Coverage Plots for selected datasets and for a label-budget $B_{M_i}=50$. The ideal curve has minimal error across the coverage-spectrum and a higher coverage or acceptable uncertainty correlates with a higher error. The dashed line refers to models trained with a sub-sampled initial data-set $\mathcal{D}_0$ (with $B_0=4800$).
  • Figure 3: Temporal results for models trained on the androzoo dataset with $B_{M_i}=50$ and for $\rho = 400$. The top-row presents the F1 score after simulated selective classification. The middle depicts the actual rejections on a monthly basis and in the bottom row is the improvement in F1 after rejection vs. the baseline of no rejections as $\Delta$ F1. The latter highlights that rejections does not always lead to improvements with respect to F1 scores for some methods.
  • Figure 4: Continuous-learning performance of DeepDrebin (with cold-start) on the androzoo benchmark.A shows the effect of four monthly label-budget settings ($B_{M_i}=50$, $B_{M_i}=100$, $B_{M_i}=200$, $B_{M_i}=400$) over increasing amounts of initially-labeled data $B_{0}$. B contrasts two subsampling heuristics applied to the initial pool $\mathcal{D}_{0}$--- StratK-Sampling and Uncertainty-Sampling---with results averaged across all label-budget settings to allow direct comparison. Horizontal dashed lines mark 10-percentage-point intervals; both panels share the same 40--90 % $F_{1}$-score range and use a logarithmic $x$-axis.
  • Figure 5: Results for Uncertainty-Sampling. Average Performance across $n=5$ trials with DeebDrebin on selected datasets. For every $B_{M_i}$ (monthly label budget for $\mathcal{D}_{\text{test}}$) and $B_0$ (selected samples from $\mathcal{D}_0$) we run a full experiment on the all months in $\mathcal{D}_{\text{test}}$ and report the average monthly performance, excluding the first 6 months of the test-period as per standard-protocol CADEHCC.
  • ...and 1 more figures