Table of Contents
Fetching ...

NetSSM: Multi-Flow and State-Aware Network Trace Generation using State-Space Models

Andrew Chu, Xi Jiang, Shinan Liu, Arjun Bhagoji, Francesco Bronzino, Paul Schmitt, Nick Feamster

TL;DR

The paper tackles the scarcity of realistic network traces by introducing NetSSM, a state-space-model–based raw packet generator that handles multi-flow sessions and flow-state dynamics. Built on the Mamba-2 backbone, NetSSM learns from long contexts and outputs PCAP traces several orders of magnitude longer than transformer-based baselines, achieving superior statistical fidelity, downstream utility, and semantic similarity while maintaining protocol compliance. It introduces a pipeline that tokenizes raw packet bytes, trains with self-supervised next-token prediction on long contexts, and generates traces with workload-specific labeling; timestamps are not learned but sampled post hoc. The results demonstrate NetSSM’s strong performance across multiple multimedia workloads, providing a practical, open-source tool for producing high-fidelity synthetic network data with broad applicability for security, modeling, and performance evaluation.

Abstract

Access to raw network traffic data is essential for many computer networking tasks, from traffic modeling to performance evaluation. Unfortunately, this data is scarce due to high collection costs and governance rules. Previous efforts explore this challenge by generating synthetic network data, but fail to reliably handle multi-flow sessions, struggle to reason about stateful communication in moderate to long-duration network sessions, and lack robust evaluations tied to real-world utility. We propose a new method based on state-space models called NetSSM that generates raw network traffic at the packet-level granularity. Our approach captures interactions between multiple, interleaved flows -- an objective unexplored in prior work -- and effectively reasons about flow-state in sessions to capture traffic characteristics. NetSSM accomplishes this by learning from and producing traces 8x and 78x longer than existing transformer-based approaches. Evaluation results show that our method generates high-fidelity traces that outperform prior efforts in existing benchmarks. We also find that NetSSM's traces have high semantic similarity to real network data regarding compliance with standard protocol requirements and flow and session-level traffic characteristics.

NetSSM: Multi-Flow and State-Aware Network Trace Generation using State-Space Models

TL;DR

The paper tackles the scarcity of realistic network traces by introducing NetSSM, a state-space-model–based raw packet generator that handles multi-flow sessions and flow-state dynamics. Built on the Mamba-2 backbone, NetSSM learns from long contexts and outputs PCAP traces several orders of magnitude longer than transformer-based baselines, achieving superior statistical fidelity, downstream utility, and semantic similarity while maintaining protocol compliance. It introduces a pipeline that tokenizes raw packet bytes, trains with self-supervised next-token prediction on long contexts, and generates traces with workload-specific labeling; timestamps are not learned but sampled post hoc. The results demonstrate NetSSM’s strong performance across multiple multimedia workloads, providing a practical, open-source tool for producing high-fidelity synthetic network data with broad applicability for security, modeling, and performance evaluation.

Abstract

Access to raw network traffic data is essential for many computer networking tasks, from traffic modeling to performance evaluation. Unfortunately, this data is scarce due to high collection costs and governance rules. Previous efforts explore this challenge by generating synthetic network data, but fail to reliably handle multi-flow sessions, struggle to reason about stateful communication in moderate to long-duration network sessions, and lack robust evaluations tied to real-world utility. We propose a new method based on state-space models called NetSSM that generates raw network traffic at the packet-level granularity. Our approach captures interactions between multiple, interleaved flows -- an objective unexplored in prior work -- and effectively reasons about flow-state in sessions to capture traffic characteristics. NetSSM accomplishes this by learning from and producing traces 8x and 78x longer than existing transformer-based approaches. Evaluation results show that our method generates high-fidelity traces that outperform prior efforts in existing benchmarks. We also find that NetSSM's traces have high semantic similarity to real network data regarding compliance with standard protocol requirements and flow and session-level traffic characteristics.

Paper Structure

This paper contains 33 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of the NetSSM pipeline.
  • Figure 1: Comparative ML performance across different model choices with mixed training data proportions.
  • Figure 2: Performance of random forest classifiers trained on mixed real/synthetic data. Models trained on NetSSM data perform best across baselines. Shading denotes the delta between the next best baseline.
  • Figure 2: Distributions for downloaded segments. KDE (log-transformed) and ECDF (non-log-transformed, displayed on log scale) plots for the number and size of downloaded segments sent per sender. The ground truth trace has a data bit rate of $1{,}366$ kbps. NetSSM's distributions overlap significantly with the real data.
  • Figure 3: Comparison of throughput (synthetic vs. corresponding ground truth trace). Each point's color/shape combination denotes a unique flow. Color/shape combinations are not shared between \ref{['fig:throughput_analysis_a']}/\ref{['fig:throughput_analysis_b']}.
  • ...and 2 more figures