Table of Contents
Fetching ...

Transcending Transcend: Revisiting Malware Classification in the Presence of Concept Drift

Federico Barbero, Feargus Pendlebury, Fabio Pierazzi, Lorenzo Cavallaro

TL;DR

Malware classifiers suffer performance decay due to concept drift as attackers evolve. The authors formalize Transcend and introduce TRANSCENDENT, a practical rejection framework built on conformal evaluation, complemented by two novel evaluators (ICE and CCE) and an approximate-TCE variant. They also implement a random-search threshold calibration strategy to scale deployment across diverse models and domains, and demonstrate superior drift detection and generalization on a 5-year Android dataset, with extensions to Windows PE and PDF malware. The work provides actionable operational guidance and releases open-source tooling, enabling robust, drift-aware defenses in real-world security pipelines.

Abstract

Machine learning for malware classification shows encouraging results, but real deployments suffer from performance degradation as malware authors adapt their techniques to evade detection. This phenomenon, known as concept drift, occurs as new malware examples evolve and become less and less like the original training examples. One promising method to cope with concept drift is classification with rejection in which examples that are likely to be misclassified are instead quarantined until they can be expertly analyzed. We propose TRANSCENDENT, a rejection framework built on Transcend, a recently proposed strategy based on conformal prediction theory. In particular, we provide a formal treatment of Transcend, enabling us to refine conformal evaluation theory -- its underlying statistical engine -- and gain a better understanding of the theoretical reasons for its effectiveness. In the process, we develop two additional conformal evaluators that match or surpass the performance of the original while significantly decreasing the computational overhead. We evaluate TRANSCENDENT on a malware dataset spanning 5 years that removes sources of experimental bias present in the original evaluation. TRANSCENDENT outperforms state-of-the-art approaches while generalizing across different malware domains and classifiers. To further assist practitioners, we determine the optimal operational settings for a TRANSCENDENT deployment and show how it can be applied to many popular learning algorithms. These insights support both old and new empirical findings, making Transcend a sound and practical solution for the first time. To this end, we release TRANSCENDENT as open source, to aid the adoption of rejection strategies by the security community.

Transcending Transcend: Revisiting Malware Classification in the Presence of Concept Drift

TL;DR

Malware classifiers suffer performance decay due to concept drift as attackers evolve. The authors formalize Transcend and introduce TRANSCENDENT, a practical rejection framework built on conformal evaluation, complemented by two novel evaluators (ICE and CCE) and an approximate-TCE variant. They also implement a random-search threshold calibration strategy to scale deployment across diverse models and domains, and demonstrate superior drift detection and generalization on a 5-year Android dataset, with extensions to Windows PE and PDF malware. The work provides actionable operational guidance and releases open-source tooling, enabling robust, drift-aware defenses in real-world security pipelines.

Abstract

Machine learning for malware classification shows encouraging results, but real deployments suffer from performance degradation as malware authors adapt their techniques to evade detection. This phenomenon, known as concept drift, occurs as new malware examples evolve and become less and less like the original training examples. One promising method to cope with concept drift is classification with rejection in which examples that are likely to be misclassified are instead quarantined until they can be expertly analyzed. We propose TRANSCENDENT, a rejection framework built on Transcend, a recently proposed strategy based on conformal prediction theory. In particular, we provide a formal treatment of Transcend, enabling us to refine conformal evaluation theory -- its underlying statistical engine -- and gain a better understanding of the theoretical reasons for its effectiveness. In the process, we develop two additional conformal evaluators that match or surpass the performance of the original while significantly decreasing the computational overhead. We evaluate TRANSCENDENT on a malware dataset spanning 5 years that removes sources of experimental bias present in the original evaluation. TRANSCENDENT outperforms state-of-the-art approaches while generalizing across different malware domains and classifiers. To further assist practitioners, we determine the optimal operational settings for a TRANSCENDENT deployment and show how it can be applied to many popular learning algorithms. These insights support both old and new empirical findings, making Transcend a sound and practical solution for the first time. To this end, we release TRANSCENDENT as open source, to aid the adoption of rejection strategies by the security community.

Paper Structure

This paper contains 52 sections, 5 equations, 13 figures, 5 tables, 4 algorithms.

Figures (13)

  • Figure 1: Possible NCMs for different classification algorithms: nearest centroid, support-vector machines (SVMs), nearest neighbors (NN), random forest, quadratic discriminant analysis (QDA), and multilayer perceptron (MLP). The solid line delineates the decision boundary between classes and while the dotted lines show SVM margins. The shaded region captures points which are more nonconform (i.e., 'less similar') than the new test point, shown by the asterisk, with respect to class . As NCMs, (a) uses the distance from the class centroid; (b) and (c) use the negated absolute distance from the hyperplane; (d) uses the proportion of nearest neighbors belonging to class ; (e) uses the proportion of decision trees that predict ; (f) uses the negated probability of belonging to class ; (g) uses the negated probability output by the final sigmoid activation layer; (h) uses the outputs of the final hidden layer to train an SVM with RBF kernel and uses the negated absolute probabilities output by that SVM---note the decision boundary still depends on the MLP output alone.
  • Figure 2: The nested intervals at which labels and are present in the output label set for a test example with per-class p-values $p_{\vcenter{\hbox{$\CIRCLE$}}}=0.32$ and $p_{\vcenter{\hbox{$\Circle$}}}=0.08$. Shaded areas outline how credibility and confidence relate to the intersection of prediction regions for which the label set contains a single element. The relatively high probability of the empty set containing the correct label (i.e., low credibility) indicates that one of conformal prediction's assumptions may have been violated. In conformal evaluation, this is used as a signal that the new example is likely out-of-distribution and is indicative of concept drift.
  • Figure 3: Illustration of the different calibration splits employed by each of the conformal evaluators showing the target of the p-value calculation, relative points included in the bag, and points excluded from the calibration.
  • Figure 4: Thresholding procedure applied to a linear SVM with approximate-TCE (3 folds). Four points highlighted with dotted outlines are left out as calibration in each fold, with the decision boundary obtained with the remaining points as training. P-values, shown above or below each calibration point, are calculated using the negated absolute distance from the decision boundary as an NCM. The shaded regions capture points which are more nonconform with respect to the predicted class (blue for class and red for class ). The alpha assessment (d) shows the distribution of p-values and per-class thresholds derived from Q1 of the correctly classified points (see \ref{['sec:search']} for a discussion of more complex search strategies for finding thresholds).
  • Figure 5: Test-time procedure applied to a linear SVM and calibrated Transcend jordaney2017transcend with distances from hyperplane and corresponding nonconformity scores shown in (a). In (b) a new test point is classified as class . The p-value is calculated as the proportion of points belonging to with equal or greater nonconformity scores (captured by the shaded region) than the new point. In (c), the new point is compared against the threshold for class as derived during the calibration phase (\ref{['fig:toy-thresholding']}). As the p-value of the new point is greater than the threshold for the predicted class, the prediction is accepted.
  • ...and 8 more figures