Table of Contents
Fetching ...

Leveraging VAE-Derived Latent Spaces for Enhanced Malware Detection with Machine Learning Classifiers

Bamidele Ajayi, Basel Barakat, Ken McGarry

TL;DR

Malware detection faces challenges from obfuscation and evolving variants. The authors propose a hybrid approach that uses Variational Autoencoder-derived latent representations as features for traditional classifiers to improve robustness and efficiency. They empirically demonstrate that ensemble classifiers, particularly LightGBM and Random Forest, achieve top accuracy and AUC across multiple datasets and data-split configurations while avoiding hyperparameter tuning. This work suggests that latent-space features can reduce computational costs and enable faster, scalable deployment in real-time cybersecurity contexts.

Abstract

This paper assesses the performance of five machine learning classifiers: Decision Tree, Naive Bayes, LightGBM, Logistic Regression, and Random Forest using latent representations learned by a Variational Autoencoder from malware datasets. Results from the experiments conducted on different training-test splits with different random seeds reveal that all the models perform well in detecting malware with ensemble methods (LightGBM and Random Forest) performing slightly better than the rest. In addition, the use of latent features reduces the computational cost of the model and the need for extensive hyperparameter tuning for improved efficiency of the model for deployment. Statistical tests show that these improvements are significant, and thus, the practical relevance of integrating latent space representation with traditional classifiers for effective malware detection in cybersecurity is established.

Leveraging VAE-Derived Latent Spaces for Enhanced Malware Detection with Machine Learning Classifiers

TL;DR

Malware detection faces challenges from obfuscation and evolving variants. The authors propose a hybrid approach that uses Variational Autoencoder-derived latent representations as features for traditional classifiers to improve robustness and efficiency. They empirically demonstrate that ensemble classifiers, particularly LightGBM and Random Forest, achieve top accuracy and AUC across multiple datasets and data-split configurations while avoiding hyperparameter tuning. This work suggests that latent-space features can reduce computational costs and enable faster, scalable deployment in real-time cybersecurity contexts.

Abstract

This paper assesses the performance of five machine learning classifiers: Decision Tree, Naive Bayes, LightGBM, Logistic Regression, and Random Forest using latent representations learned by a Variational Autoencoder from malware datasets. Results from the experiments conducted on different training-test splits with different random seeds reveal that all the models perform well in detecting malware with ensemble methods (LightGBM and Random Forest) performing slightly better than the rest. In addition, the use of latent features reduces the computational cost of the model and the need for extensive hyperparameter tuning for improved efficiency of the model for deployment. Statistical tests show that these improvements are significant, and thus, the practical relevance of integrating latent space representation with traditional classifiers for effective malware detection in cybersecurity is established.

Paper Structure

This paper contains 6 sections, 5 figures, 11 tables, 2 algorithms.

Figures (5)

  • Figure 1: Flow Diagram for Classifier with Latent Space Representations
  • Figure 2: EMBER Latent Feature Accuracy VS Training
  • Figure 3: EMBER Latent Feature AUC VS Training
  • Figure 4: BODMAS Latent Feature Accuracy VS Training
  • Figure 5: BODMAS Latent Feature AUC VS Training.