NetFlowGen: Leveraging Generative Pre-training for Network Traffic Dynamics

Jiawei Zhou; Woojeong Kim; Zhiying Xu; Alexander M. Rush; Minlan Yu

NetFlowGen: Leveraging Generative Pre-training for Network Traffic Dynamics

Jiawei Zhou, Woojeong Kim, Zhiying Xu, Alexander M. Rush, Minlan Yu

TL;DR

NetFlowGen tackles label scarcity in network traffic analytics by pre-training a decoder Transformer on unlabeled NetFlow records to learn general traffic dynamics. It discretizes and embeds heterogeneous NetFlow features, enabling a unified representation that supports downstream fine-tuning for tasks like early DDoS detection with minimal labels. The approach shows improved next-step traffic prediction and robust downstream performance, including unseen nodes and varied attack types, highlighting the practicality of network foundation models. The work outlines concrete directions for richer feature representations, topology-aware modeling, and scaling foundation models in networking contexts.

Abstract

Understanding the traffic dynamics in networks is a core capability for automated systems to monitor and analyze networking behaviors, reducing expensive human efforts and economic risks through tasks such as traffic classification, congestion prediction, and attack detection. However, it is still challenging to accurately model network traffic with machine learning approaches in an efficient and broadly applicable manner. Task-specific models trained from scratch are used for different networking applications, which limits the efficiency of model development and generalization of model deployment. Furthermore, while networking data is abundant, high-quality task-specific labels are often insufficient for training individual models. Large-scale self-supervised learning on unlabeled data provides a natural pathway for tackling these challenges. We propose to pre-train a general-purpose machine learning model to capture traffic dynamics with only traffic data from NetFlow records, with the goal of fine-tuning for different downstream tasks with small amount of labels. Our presented NetFlowGen framework goes beyond a proof-of-concept for network traffic pre-training and addresses specific challenges such as unifying network feature representations, learning from large unlabeled traffic data volume, and testing on real downstream tasks in DDoS attack detection. Experiments demonstrate promising results of our pre-training framework on capturing traffic dynamics and adapting to different networking tasks.

NetFlowGen: Leveraging Generative Pre-training for Network Traffic Dynamics

TL;DR

Abstract

Paper Structure (29 sections, 1 equation, 6 figures, 10 tables, 1 algorithm)

This paper contains 29 sections, 1 equation, 6 figures, 10 tables, 1 algorithm.

Introduction
Challenges and opportunities
Method
Pre-training
Feature Discretization
Feature Embedding
Fine-tuning: DDoS Attack Detection
Dataset Construction
NetFlow Data Serialization
NetFlow Data Filtering
Downstream Task: Early DDoS Attack Detection
Experimental Setup
Model & Data Configuration
Evaluation
Model Implementation
...and 14 more sections

Figures (6)

Figure 1: A desirable vision of a network foundation model that captures comprehensive traffic dynamics and the same model can be adapted to various downstream networking tasks.
Figure 2: NetFlowGen generative pre-training framework. The framework consists of two parts: generative pre-training and fine-tuning. The objective of pre-training is to predict the traffic of time-step $T$ given the history back from $T-1$. As Transformer models work well when predicting discrete tokens, we transform raw NetFlow traffic into discrete values and use them as model input and target output. The feature representation process is illustrated in more detail in Figure \ref{['fig:embed']}.
Figure 3: Feature representation process of NetFlowGen. We employ two different embedding methods. For inherently continuous features like traffic and time features, we discretize them before transforming them into a consistent fixed-size continuous vector. On the other hand, discrete metadata, such as node ID and customer ID, leverages conventional embedding methods by performing look-ups from trainable embedding tables.
Figure 4: Late detection
Figure 5: Early detection
...and 1 more figures

NetFlowGen: Leveraging Generative Pre-training for Network Traffic Dynamics

TL;DR

Abstract

NetFlowGen: Leveraging Generative Pre-training for Network Traffic Dynamics

Authors

TL;DR

Abstract

Table of Contents

Figures (6)