Optimized Deep Learning Models for Malware Detection under Concept Drift
William Maillet, Benjamin Marais
TL;DR
This work tackles concept drift in malware detection by proposing a model-agnostic protocol that combines a drift-aware loss (DRBCE), validation using the most recent data, and feature reduction via Permutation Feature Importance. Using EMBER for training and recent BODMAS and MalwareBazaar data for evaluation, the authors demonstrate that DRBCE, when paired with a fresh validation set and reduced feature set, substantially improves long-term detection under drift, achieving up to a 15.2% increase in MalwareBazaar accuracy. The study provides actionable choices for practitioners, highlighting that BCE excels for near-term detection while DRBCE offers robustness to evolving threats, and argues for potential ensemble strategies to further enhance drift resilience. Overall, the approach offers a practical, model-agnostic toolkit for maintaining malware-detection performance in rapidly changing threat landscapes, with implications for organizations facing continual concept drift.
Abstract
Despite the promising results of machine learning models in malicious files detection, they face the problem of concept drift due to their constant evolution. This leads to declining performance over time, as the data distribution of the new files differs from the training one, requiring frequent model update. In this work, we propose a model-agnostic protocol to improve a baseline neural network against drift. We show the importance of feature reduction and training with the most recent validation set possible, and propose a loss function named Drift-Resilient Binary Cross-Entropy, an improvement to the classical Binary Cross-Entropy more effective against drift. We train our model on the EMBER dataset, published in2018, and evaluate it on a dataset of recent malicious files, collected between 2020 and 2023. Our improved model shows promising results, detecting 15.2% more malware than a baseline model.
