MalDataGen: A Modular Framework for Synthetic Tabular Data Generation in Malware Detection
Kayua Oleques Paim, Angelo Gaspar Diniz Nogueira, Diego Kreutz, Weverton Cordeiro, Rodrigo Brandao Mansilha
TL;DR
The paper tackles data scarcity in malware detection by introducing MalDataGen, a modular open-source framework for synthetic tabular data generation using diverse deep generative models (e.g., WGAN-GP, VQ-VAE, Latent Diffusion Models) with Android-malware adaptations. It provides an Engine and Evaluation Resources for flexible data handling, model orchestration, and rigorous, reproducible evaluation across dual validation schemes ($TR$-$TS$ and $TS$-$TR$). Through experiments on the Androcrawl dataset across seven classifiers, MalDataGen often outperforms SDV in both utility (accuracy, precision, recall, F1, AUC) and fidelity (distance metrics), demonstrating the practical utility of composable synthetic data pipelines in cybersecurity. The framework’s modularity enables easy integration into detection pipelines and supports future expansion of models, metrics, and interoperability with other tools.
Abstract
High-quality data scarcity hinders malware detection, limiting ML performance. We introduce MalDataGen, an open-source modular framework for generating high-fidelity synthetic tabular data using modular deep learning models (e.g., WGAN-GP, VQ-VAE). Evaluated via dual validation (TR-TS/TS-TR), seven classifiers, and utility metrics, MalDataGen outperforms benchmarks like SDV while preserving data utility. Its flexible design enables seamless integration into detection pipelines, offering a practical solution for cybersecurity applications.
