Table of Contents
Fetching ...

Reducing Instability in Synthetic Data Evaluation with a Super-Metric in MalDataGen

Anna Luiza Gomes da Silva, Diego Kreutz, Angelo Diniz, Rodrigo Mansilha, Celso Nobre da Fonseca

TL;DR

Synthetic data evaluation in the Android malware domain suffers from instability and non-standardized fidelity metrics. The paper introduces a Super-Metric that aggregates eight fidelity metrics across four dimensions into a single weighted score and integrates it into the MalDataGen framework to enable unified benchmarking. Empirical results across ten generative models and five datasets show the Super-Metric offers greater stability and better alignment with classifier recall and F1-score than traditional metrics, enhancing reproducibility and interpretability. This approach provides a robust, adaptable benchmark for synthetic-tabular data quality with potential extension to other domains and deployment pipelines.

Abstract

Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. This work integrates into MalDataGen a Super-Metric that aggregates eight metrics across four fidelity dimensions, producing a single weighted score. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers.

Reducing Instability in Synthetic Data Evaluation with a Super-Metric in MalDataGen

TL;DR

Synthetic data evaluation in the Android malware domain suffers from instability and non-standardized fidelity metrics. The paper introduces a Super-Metric that aggregates eight fidelity metrics across four dimensions into a single weighted score and integrates it into the MalDataGen framework to enable unified benchmarking. Empirical results across ten generative models and five datasets show the Super-Metric offers greater stability and better alignment with classifier recall and F1-score than traditional metrics, enhancing reproducibility and interpretability. This approach provides a robust, adaptable benchmark for synthetic-tabular data quality with potential extension to other domains and deployment pipelines.

Abstract

Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. This work integrates into MalDataGen a Super-Metric that aggregates eight metrics across four fidelity dimensions, producing a single weighted score. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers.

Paper Structure

This paper contains 5 sections, 3 figures.

Figures (3)

  • Figure 1: Workflow of the synthetic data generation and evaluation methodology in MalDataGen.
  • Figure 2: Heatmap – Average correlation between fidelity metrics and utility metrics (recall and F1-score) per generative model.
  • Figure 3: Boxplot – Distribution of the correlation between fidelity metrics and utility metrics (recall and F1-score) per generative model.