Table of Contents
Fetching ...

Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis

Mohammad Zbeeb, Mohammad Ghorayeb, Mariam Salman

TL;DR

This work tackles synthetic data generation for highly structured, privacy-sensitive network-traffic data. It reframes numeric flows as symbolic text and compares three sequence-model families—WaveNet, RNN, and Transformer—trained to predict the next symbol. Results show the RNN achieving the highest alignment with real data (about 87.9% inliers), with Transformer also performing well (about 84.9%), while WaveNet trails (about 69.2%), highlighting the efficacy of sequential models for structured data. The study contributes a privacy-aware evaluation framework, offers open-source code and models, and discusses broader implications and applications across domains.

Abstract

Artificial Intelligence (AI) research often aims to develop models that can generalize reliably across complex datasets, yet this remains challenging in fields where data is scarce, intricate, or inaccessible. This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize one of the most demanding structured datasets: Malicious Network Traffic. Our approach uniquely transforms numerical data into text, re-framing data generation as a language modeling task, which not only enhances data regularization but also significantly improves generalization and the quality of the synthetic data. Extensive statistical analyses demonstrate that our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data. Additionally, we conduct a comprehensive study on synthetic data applications, effectiveness, and evaluation strategies, offering valuable insights into its role across various domains. Our code and pre-trained models are openly accessible at Github, enabling further exploration and application of our methodology. Index Terms: Data synthesis, machine learning, traffic generation, privacy preserving data, generative models.

Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis

TL;DR

This work tackles synthetic data generation for highly structured, privacy-sensitive network-traffic data. It reframes numeric flows as symbolic text and compares three sequence-model families—WaveNet, RNN, and Transformer—trained to predict the next symbol. Results show the RNN achieving the highest alignment with real data (about 87.9% inliers), with Transformer also performing well (about 84.9%), while WaveNet trails (about 69.2%), highlighting the efficacy of sequential models for structured data. The study contributes a privacy-aware evaluation framework, offers open-source code and models, and discusses broader implications and applications across domains.

Abstract

Artificial Intelligence (AI) research often aims to develop models that can generalize reliably across complex datasets, yet this remains challenging in fields where data is scarce, intricate, or inaccessible. This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize one of the most demanding structured datasets: Malicious Network Traffic. Our approach uniquely transforms numerical data into text, re-framing data generation as a language modeling task, which not only enhances data regularization but also significantly improves generalization and the quality of the synthetic data. Extensive statistical analyses demonstrate that our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data. Additionally, we conduct a comprehensive study on synthetic data applications, effectiveness, and evaluation strategies, offering valuable insights into its role across various domains. Our code and pre-trained models are openly accessible at Github, enabling further exploration and application of our methodology. Index Terms: Data synthesis, machine learning, traffic generation, privacy preserving data, generative models.

Paper Structure

This paper contains 29 sections, 13 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: PCA Explained Variance Plot: The majority of the variance is captured by the first few principal components, indicating that much of the data's complexity can be explained by a small subset of components.
  • Figure 2: Comparison of Classification and Regression Manifolds. The left plot represents the classification problem with a decision boundary, while the right plot shows the regression problem with a fitted regression line.
  • Figure 3: Architectures Used in Our Study
  • Figure 4: A Recurrent Neural Network
  • Figure 5: Transformer Architecture
  • ...and 2 more figures