Table of Contents
Fetching ...

Enhancing Decision-Making in Windows PE Malware Classification During Dataset Shifts with Uncertainty Estimation

Rahul Yumlembam, Biju Issac, Seibu Mary Jacob

TL;DR

This work tackles the unreliability of Windows PE malware classifiers under dataset shifts by augmenting a LightGBM detector with neural networks, PriorNet, and neural-network ensembles, and by integrating ensemble-derived uncertainty into Inductive Conformal Evaluation (ICE) with a novel harmonic-threshold optimization. The authors calibrate probabilities, compute ensemble-based uncertainty metrics (Expected Entropy, Entropy of Expected, Knowledge Uncertainty), and evaluate decision rules via probability thresholds and ICE on EMBER, UCSB, and BODMAS datasets, including a severe shift scenario with packed malware. Key findings show that ensemble-based uncertainty within ICE significantly reduces incorrect predictions under shifts, outperforming state-of-the-art probability-based ICE, while calibration improves overall reliability; PriorNet is less robust to distributional changes due to reliance on out-of-distribution data. The results offer a practical pathway for uncertainty-aware, robust malware detection in security operations, enabling targeted human review and sandboxing for high-uncertainty cases while maintaining strong automation for high-confidence predictions.

Abstract

Artificial intelligence techniques have achieved strong performance in classifying Windows Portable Executable (PE) malware, but their reliability often degrades under dataset shifts, leading to misclassifications with severe security consequences. To address this, we enhance an existing LightGBM (LGBM) malware detector by integrating Neural Networks (NN), PriorNet, and Neural Network Ensembles, evaluated across three benchmark datasets: EMBER, BODMAS, and UCSB. The UCSB dataset, composed mainly of packed malware, introduces a substantial distributional shift relative to EMBER and BODMAS, making it a challenging testbed for robustness. We study uncertainty-aware decision strategies, including probability thresholding, PriorNet, ensemble-derived estimates, and Inductive Conformal Evaluation (ICE). Our main contribution is the use of ensemble-based uncertainty estimates as Non-Conformity Measures within ICE, combined with a novel threshold optimisation method. On the UCSB dataset, where the shift is most severe, the state-of-the-art probability-based ICE (SOTA) yields an incorrect acceptance rate (IA%) of 22.8%. In contrast, our method reduces this to 16% a relative reduction of about 30% while maintaining competitive correct acceptance rates (CA%). These results demonstrate that integrating ensemble-based uncertainty with conformal prediction provides a more reliable safeguard against misclassifications under extreme dataset shifts, particularly in the presence of packed malware, thereby offering practical benefits for real-world security operations.

Enhancing Decision-Making in Windows PE Malware Classification During Dataset Shifts with Uncertainty Estimation

TL;DR

This work tackles the unreliability of Windows PE malware classifiers under dataset shifts by augmenting a LightGBM detector with neural networks, PriorNet, and neural-network ensembles, and by integrating ensemble-derived uncertainty into Inductive Conformal Evaluation (ICE) with a novel harmonic-threshold optimization. The authors calibrate probabilities, compute ensemble-based uncertainty metrics (Expected Entropy, Entropy of Expected, Knowledge Uncertainty), and evaluate decision rules via probability thresholds and ICE on EMBER, UCSB, and BODMAS datasets, including a severe shift scenario with packed malware. Key findings show that ensemble-based uncertainty within ICE significantly reduces incorrect predictions under shifts, outperforming state-of-the-art probability-based ICE, while calibration improves overall reliability; PriorNet is less robust to distributional changes due to reliance on out-of-distribution data. The results offer a practical pathway for uncertainty-aware, robust malware detection in security operations, enabling targeted human review and sandboxing for high-uncertainty cases while maintaining strong automation for high-confidence predictions.

Abstract

Artificial intelligence techniques have achieved strong performance in classifying Windows Portable Executable (PE) malware, but their reliability often degrades under dataset shifts, leading to misclassifications with severe security consequences. To address this, we enhance an existing LightGBM (LGBM) malware detector by integrating Neural Networks (NN), PriorNet, and Neural Network Ensembles, evaluated across three benchmark datasets: EMBER, BODMAS, and UCSB. The UCSB dataset, composed mainly of packed malware, introduces a substantial distributional shift relative to EMBER and BODMAS, making it a challenging testbed for robustness. We study uncertainty-aware decision strategies, including probability thresholding, PriorNet, ensemble-derived estimates, and Inductive Conformal Evaluation (ICE). Our main contribution is the use of ensemble-based uncertainty estimates as Non-Conformity Measures within ICE, combined with a novel threshold optimisation method. On the UCSB dataset, where the shift is most severe, the state-of-the-art probability-based ICE (SOTA) yields an incorrect acceptance rate (IA%) of 22.8%. In contrast, our method reduces this to 16% a relative reduction of about 30% while maintaining competitive correct acceptance rates (CA%). These results demonstrate that integrating ensemble-based uncertainty with conformal prediction provides a more reliable safeguard against misclassifications under extreme dataset shifts, particularly in the presence of packed malware, thereby offering practical benefits for real-world security operations.

Paper Structure

This paper contains 20 sections, 29 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Correctly Accepted and Correctly Rejected Trade-off Graph using Uncertainity Estimates from NN ensemble on EMBER Test dataset. Where: UC: Uncalibrated, C-Calibrated, Correctly Accepted: A, Correctly Rejected: R, EE: Expected Entropy, EoE: Entropy of Expected and KU: Knowledge Uncertainity
  • Figure 2: Performance assessment under severe out-of-distribution (OOD) shift: The UCSB dataset was tested on the EMBER dataset. The plot shows the trade-off between Correct Acceptance Rate (CA%) and Incorrect Acceptance Rate (IA%). Lower IA% corresponds to safer behaviour (fewer incorrect predictions accepted), while higher CA% indicates better retention of correct predictions. Labels are embedded inside the plot to identify each model. Label mapping: Prob-NN = Probability-Cal-LGBM+NN; Prob-PN = Probability-Cal-LGBM+PriorNet; Prob-ENS = Probability-Cal-LGBM+Ensemble; EE-PN = Expected Entropy-Cal-LGBM+PriorNet; EoE-PN = Entropy of Expected-Cal-LGBM+PriorNet; KU-PN = Knowledge Uncertainty-Cal-LGBM+PriorNet; EE-ENS = Expected Entropy-Cal-LGBM+NN Ensemble; EoE-ENS = Entropy of Expected-Cal-LGBM+NN Ensemble; KU-ENS = Knowledge Uncertainty-Cal-LGBM+NN Ensemble; Prob-NN+ICE = Cal-LGBM+NN+ICE+NCM:Probability (SOTA); Prob-PN+ICE = Cal-LGBM+PriorNet+ICE+NCM:Probability (SOTA); Prob-ENS+ICE = Cal-LGBM+NN-Ensemble+ICE+NCM:Probability (SOTA); EE-ENS+ICE(OW) = Cal-LGBM+NN-Ensemble+ICE+NCM:Expected Entropy (Our Work); EoE-ENS+ICE(OW) = Cal-LGBM+NN-Ensemble+ICE+NCM:Entropy of Expected (Our Work); KU-ENS+ICE(OW) = Cal-LGBM+NN-Ensemble+ICE+NCM:Knowledge Uncertainty (Our Work).
  • Figure 3: Correctly Accepted and Correctly Rejected Trade-off Graph using Uncertainity Estimates from NN ensemble on UCSB Test dataset. Where: UC: Uncalibrated, C-Calibrated, Correctly Accepted: A, Correctly Rejected: R, EE: Expected Entropy, EoE: Entropy of Expected and KU: Knowledge Uncertainity
  • Figure 4: Correctly Accepted and Correctly Rejected Trade-off Graph using a range of Probability threshold on EMBER Test dataset. Where, UC: Uncalibrated, C-Calibrated, Correctly Accepted:A, Correctly Rejected: R
  • Figure 5: Correctly Accepted and Correctly Rejected Trade-off Graph using a range of Probability threshold on UCSB Test dataset. Where, UC: Uncalibrated, C-Calibrated, Correctly Accepted:A, Correctly Rejected: R
  • ...and 4 more figures