Table of Contents
Fetching ...

Optimized Deep Learning Models for Malware Detection under Concept Drift

William Maillet, Benjamin Marais

TL;DR

This work tackles concept drift in malware detection by proposing a model-agnostic protocol that combines a drift-aware loss (DRBCE), validation using the most recent data, and feature reduction via Permutation Feature Importance. Using EMBER for training and recent BODMAS and MalwareBazaar data for evaluation, the authors demonstrate that DRBCE, when paired with a fresh validation set and reduced feature set, substantially improves long-term detection under drift, achieving up to a 15.2% increase in MalwareBazaar accuracy. The study provides actionable choices for practitioners, highlighting that BCE excels for near-term detection while DRBCE offers robustness to evolving threats, and argues for potential ensemble strategies to further enhance drift resilience. Overall, the approach offers a practical, model-agnostic toolkit for maintaining malware-detection performance in rapidly changing threat landscapes, with implications for organizations facing continual concept drift.

Abstract

Despite the promising results of machine learning models in malicious files detection, they face the problem of concept drift due to their constant evolution. This leads to declining performance over time, as the data distribution of the new files differs from the training one, requiring frequent model update. In this work, we propose a model-agnostic protocol to improve a baseline neural network against drift. We show the importance of feature reduction and training with the most recent validation set possible, and propose a loss function named Drift-Resilient Binary Cross-Entropy, an improvement to the classical Binary Cross-Entropy more effective against drift. We train our model on the EMBER dataset, published in2018, and evaluate it on a dataset of recent malicious files, collected between 2020 and 2023. Our improved model shows promising results, detecting 15.2% more malware than a baseline model.

Optimized Deep Learning Models for Malware Detection under Concept Drift

TL;DR

This work tackles concept drift in malware detection by proposing a model-agnostic protocol that combines a drift-aware loss (DRBCE), validation using the most recent data, and feature reduction via Permutation Feature Importance. Using EMBER for training and recent BODMAS and MalwareBazaar data for evaluation, the authors demonstrate that DRBCE, when paired with a fresh validation set and reduced feature set, substantially improves long-term detection under drift, achieving up to a 15.2% increase in MalwareBazaar accuracy. The study provides actionable choices for practitioners, highlighting that BCE excels for near-term detection while DRBCE offers robustness to evolving threats, and argues for potential ensemble strategies to further enhance drift resilience. Overall, the approach offers a practical, model-agnostic toolkit for maintaining malware-detection performance in rapidly changing threat landscapes, with implications for organizations facing continual concept drift.

Abstract

Despite the promising results of machine learning models in malicious files detection, they face the problem of concept drift due to their constant evolution. This leads to declining performance over time, as the data distribution of the new files differs from the training one, requiring frequent model update. In this work, we propose a model-agnostic protocol to improve a baseline neural network against drift. We show the importance of feature reduction and training with the most recent validation set possible, and propose a loss function named Drift-Resilient Binary Cross-Entropy, an improvement to the classical Binary Cross-Entropy more effective against drift. We train our model on the EMBER dataset, published in2018, and evaluate it on a dataset of recent malicious files, collected between 2020 and 2023. Our improved model shows promising results, detecting 15.2% more malware than a baseline model.
Paper Structure (17 sections, 4 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 17 sections, 4 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Different ways the drift occurs Lemaire2015
  • Figure 2: BODMAS monthly malicious and benign files distribution
  • Figure 3: Monthly sample distribution in the MalwareBazaar dataset
  • Figure 4: Illustration of the chronological order for each subset
  • Figure 5: Architecture of the baseline model
  • ...and 2 more figures