Table of Contents
Fetching ...

MalDataGen: A Modular Framework for Synthetic Tabular Data Generation in Malware Detection

Kayua Oleques Paim, Angelo Gaspar Diniz Nogueira, Diego Kreutz, Weverton Cordeiro, Rodrigo Brandao Mansilha

TL;DR

The paper tackles data scarcity in malware detection by introducing MalDataGen, a modular open-source framework for synthetic tabular data generation using diverse deep generative models (e.g., WGAN-GP, VQ-VAE, Latent Diffusion Models) with Android-malware adaptations. It provides an Engine and Evaluation Resources for flexible data handling, model orchestration, and rigorous, reproducible evaluation across dual validation schemes ($TR$-$TS$ and $TS$-$TR$). Through experiments on the Androcrawl dataset across seven classifiers, MalDataGen often outperforms SDV in both utility (accuracy, precision, recall, F1, AUC) and fidelity (distance metrics), demonstrating the practical utility of composable synthetic data pipelines in cybersecurity. The framework’s modularity enables easy integration into detection pipelines and supports future expansion of models, metrics, and interoperability with other tools.

Abstract

High-quality data scarcity hinders malware detection, limiting ML performance. We introduce MalDataGen, an open-source modular framework for generating high-fidelity synthetic tabular data using modular deep learning models (e.g., WGAN-GP, VQ-VAE). Evaluated via dual validation (TR-TS/TS-TR), seven classifiers, and utility metrics, MalDataGen outperforms benchmarks like SDV while preserving data utility. Its flexible design enables seamless integration into detection pipelines, offering a practical solution for cybersecurity applications.

MalDataGen: A Modular Framework for Synthetic Tabular Data Generation in Malware Detection

TL;DR

The paper tackles data scarcity in malware detection by introducing MalDataGen, a modular open-source framework for synthetic tabular data generation using diverse deep generative models (e.g., WGAN-GP, VQ-VAE, Latent Diffusion Models) with Android-malware adaptations. It provides an Engine and Evaluation Resources for flexible data handling, model orchestration, and rigorous, reproducible evaluation across dual validation schemes (- and -). Through experiments on the Androcrawl dataset across seven classifiers, MalDataGen often outperforms SDV in both utility (accuracy, precision, recall, F1, AUC) and fidelity (distance metrics), demonstrating the practical utility of composable synthetic data pipelines in cybersecurity. The framework’s modularity enables easy integration into detection pipelines and supports future expansion of models, metrics, and interoperability with other tools.

Abstract

High-quality data scarcity hinders malware detection, limiting ML performance. We introduce MalDataGen, an open-source modular framework for generating high-fidelity synthetic tabular data using modular deep learning models (e.g., WGAN-GP, VQ-VAE). Evaluated via dual validation (TR-TS/TS-TR), seven classifiers, and utility metrics, MalDataGen outperforms benchmarks like SDV while preserving data utility. Its flexible design enables seamless integration into detection pipelines, offering a practical solution for cybersecurity applications.

Paper Structure

This paper contains 6 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the MalDataGen composable framework.
  • Figure 2: Overview of the Generative Models architecture (with illustrative examples in parentheses).
  • Figure 3: Utility assessment: Binary classification metrics for SVM classifier performance using data generated by different models.
  • Figure 4: Evaluating of Adversarial model via SVM Confusion Matrices.
  • Figure 5: Comparative heat map.
  • ...and 1 more figures