Optimized Deep Learning Models for Malware Detection under Concept Drift

William Maillet; Benjamin Marais

Optimized Deep Learning Models for Malware Detection under Concept Drift

William Maillet, Benjamin Marais

TL;DR

This work tackles concept drift in malware detection by proposing a model-agnostic protocol that combines a drift-aware loss (DRBCE), validation using the most recent data, and feature reduction via Permutation Feature Importance. Using EMBER for training and recent BODMAS and MalwareBazaar data for evaluation, the authors demonstrate that DRBCE, when paired with a fresh validation set and reduced feature set, substantially improves long-term detection under drift, achieving up to a 15.2% increase in MalwareBazaar accuracy. The study provides actionable choices for practitioners, highlighting that BCE excels for near-term detection while DRBCE offers robustness to evolving threats, and argues for potential ensemble strategies to further enhance drift resilience. Overall, the approach offers a practical, model-agnostic toolkit for maintaining malware-detection performance in rapidly changing threat landscapes, with implications for organizations facing continual concept drift.

Abstract

Despite the promising results of machine learning models in malicious files detection, they face the problem of concept drift due to their constant evolution. This leads to declining performance over time, as the data distribution of the new files differs from the training one, requiring frequent model update. In this work, we propose a model-agnostic protocol to improve a baseline neural network against drift. We show the importance of feature reduction and training with the most recent validation set possible, and propose a loss function named Drift-Resilient Binary Cross-Entropy, an improvement to the classical Binary Cross-Entropy more effective against drift. We train our model on the EMBER dataset, published in2018, and evaluate it on a dataset of recent malicious files, collected between 2020 and 2023. Our improved model shows promising results, detecting 15.2% more malware than a baseline model.

Optimized Deep Learning Models for Malware Detection under Concept Drift

TL;DR

Abstract

Paper Structure (17 sections, 4 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 17 sections, 4 equations, 7 figures, 7 tables, 1 algorithm.

Introduction
Definitions and Related Works
Drift Definition
Drift in Malware Context
Solutions to Handle Drift
Methodology
Data and Training Approach
Feature Selection and Reduction
Drift Resilient Binary Cross-Entropy
Models
Evaluation
Experiments and Results
DRBCE Loss and hyper-parameters tuning
Validation set
Feature reduction
...and 2 more sections

Figures (7)

Figure 1: Different ways the drift occurs Lemaire2015
Figure 2: BODMAS monthly malicious and benign files distribution
Figure 3: Monthly sample distribution in the MalwareBazaar dataset
Figure 4: Illustration of the chronological order for each subset
Figure 5: Architecture of the baseline model
...and 2 more figures

Optimized Deep Learning Models for Malware Detection under Concept Drift

TL;DR

Abstract

Optimized Deep Learning Models for Malware Detection under Concept Drift

Authors

TL;DR

Abstract

Table of Contents

Figures (7)