Predicting Vulnerability to Malware Using Machine Learning Models: A Study on Microsoft Windows Machines
Marzieh Esnaashari, Nima Moradi
TL;DR
The paper tackles predicting malware vulnerability for Microsoft Windows machines by leveraging a large-scale Windows Defender-derived dataset from the Microsoft Malware Prediction Kaggle competition. It systematically compares multiple ML approaches—including Gaussian Naive Bayes, Logistic Regression, Decision Trees, a Stacking ensemble, and gradient-boosting models (XGBoost and LightGBM)—with careful feature engineering to handle high-cardinality categorical features. Key findings show that gradient-boosting ensembles, especially LightGBM, offer the strongest practical performance on this dataset, achieving the highest training accuracy and competitive test performance, though the overall test accuracy remains in the mid-60s due to missing-value removals and feature limitations. The study yields actionable insights for enterprise defense, highlighting which features drive predictions and illustrating the trade-offs between speed, memory usage, and predictive power; it also identifies opportunities for improvement through missing-value imputation and richer feature engineering on larger, more diverse datasets.
Abstract
In an era of escalating cyber threats, malware poses significant risks to individuals and organizations, potentially leading to data breaches, system failures, and substantial financial losses. This study addresses the urgent need for effective malware detection strategies by leveraging Machine Learning (ML) techniques on extensive datasets collected from Microsoft Windows Defender. Our research aims to develop an advanced ML model that accurately predicts malware vulnerabilities based on the specific conditions of individual machines. Moving beyond traditional signature-based detection methods, we incorporate historical data and innovative feature engineering to enhance detection capabilities. This study makes several contributions: first, it advances existing malware detection techniques by employing sophisticated ML algorithms; second, it utilizes a large-scale, real-world dataset to ensure the applicability of findings; third, it highlights the importance of feature analysis in identifying key indicators of malware infections; and fourth, it proposes models that can be adapted for enterprise environments, offering a proactive approach to safeguarding extensive networks against emerging threats. We aim to improve cybersecurity resilience, providing critical insights for practitioners in the field and addressing the evolving challenges posed by malware in a digital landscape. Finally, discussions on results, insights, and conclusions are presented.
