Table of Contents
Fetching ...

Synthetic Data: AI's New Weapon Against Android Malware

Angelo Gaspar Diniz Nogueira, Kayua Oleques Paim, Hendrio Bragança, Rodrigo Brandão Mansilha, Diego Kreutz

TL;DR

Android malware detection suffers from data scarcity and rapid evolution aided by AI. The authors propose MalSynGen, a conditional GAN-based framework to generate synthetic tabular data that preserves statistical properties of real data and improves classifier performance. They introduce interventions for fidelity and utility evaluation and apply TSTR and TRTS across six datasets, showing generalization and practical viability. The results demonstrate high utility (AUC ~0.80–0.99) and faithful data distributions on most datasets, with clear guidelines on computational trade-offs and future extensions.

Abstract

The ever-increasing number of Android devices and the accelerated evolution of malware, reaching over 35 million samples by 2024, highlight the critical importance of effective detection methods. Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques. Although machine learning has shown promise in malware classification, its success relies heavily on the availability of up-to-date, high-quality datasets. The scarcity and high cost of obtaining and labeling real malware samples presents significant challenges in developing robust detection models. In this paper, we propose MalSynGen, a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic tabular data. This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers. We evaluated the effectiveness of this approach using various datasets and metrics that assess the fidelity of the generated data, its utility in classification, and the computational efficiency of the process. Our experiments demonstrate that MalSynGen can generalize across different datasets, providing a viable solution to address the issues of obsolescence and low quality data in malware detection.

Synthetic Data: AI's New Weapon Against Android Malware

TL;DR

Android malware detection suffers from data scarcity and rapid evolution aided by AI. The authors propose MalSynGen, a conditional GAN-based framework to generate synthetic tabular data that preserves statistical properties of real data and improves classifier performance. They introduce interventions for fidelity and utility evaluation and apply TSTR and TRTS across six datasets, showing generalization and practical viability. The results demonstrate high utility (AUC ~0.80–0.99) and faithful data distributions on most datasets, with clear guidelines on computational trade-offs and future extensions.

Abstract

The ever-increasing number of Android devices and the accelerated evolution of malware, reaching over 35 million samples by 2024, highlight the critical importance of effective detection methods. Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques. Although machine learning has shown promise in malware classification, its success relies heavily on the availability of up-to-date, high-quality datasets. The scarcity and high cost of obtaining and labeling real malware samples presents significant challenges in developing robust detection models. In this paper, we propose MalSynGen, a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic tabular data. This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers. We evaluated the effectiveness of this approach using various datasets and metrics that assess the fidelity of the generated data, its utility in classification, and the computational efficiency of the process. Our experiments demonstrate that MalSynGen can generalize across different datasets, providing a viable solution to address the issues of obsolescence and low quality data in malware detection.

Paper Structure

This paper contains 17 sections, 1 equation, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Selection, training, and evaluation of cGANs using MalSynGen. Solid arrows denote the sequential order of processes, while dotted arrows indicate the flow of artifact creation or utilization (datasets, models, raw data, and performance metrics). Certain artifacts are depicted multiple times to streamline the visualization of arrow intersections.
  • Figure 2: MalSynGen is based on the architecture of Conditional Generative Adversarial Networks (cGANs).
  • Figure 3: Generator of the cGAN architecture used in MalSynGen.
  • Figure 4: Discriminator of the cGAN architecture used in MalSynGen.
  • Figure 5: Clustering of malware samples within datasets.
  • ...and 13 more figures