Table of Contents
Fetching ...

A Pipeline of Augmentation and Sequence Embedding for Classification of Imbalanced Network Traffic

Matin Shokri, Ramin Hasibi

TL;DR

The paper tackles imbalanced network traffic classification by coupling an LSTM+KDE augmentation pipeline with a Flow-as-Sentence embedding (FS-Embedding) and a Transformer encoder classifier. This approach balances sparse classes while replacing sparse one-hot features with dense embeddings, enabling faster convergence and reduced parameter counts. Empirical results on a real 19-class traffic dataset show that augmentation improves learning efficiency and that FS-Embedding can achieve comparable or superior accuracy with substantially fewer parameters than one-hot baselines. Overall, the method offers a scalable, efficient strategy for robust NTC in imbalanced, real-world settings.

Abstract

Network Traffic Classification (NTC) is one of the most important tasks in network management. The imbalanced nature of classes on the internet presents a critical challenge in classification tasks. For example, some classes of applications are much more prevalent than others, such as HTTP. As a result, machine learning classification models do not perform well on those classes with fewer data. To address this problem, we propose a pipeline to balance the dataset and classify it using a robust and accurate embedding technique. First, we generate artificial data using Long Short-Term Memory (LSTM) networks and Kernel Density Estimation (KDE). Next, we propose replacing one-hot encoding for categorical features with a novel embedding framework based on the "Flow as a Sentence" perspective, which we name FS-Embedding. This framework treats the source and destination ports, along with the packet's direction, as one word in a flow, then trains an embedding vector space based on these new features through the learning classification task. Finally, we compare our pipeline with the training of a Convolutional Recurrent Neural Network (CRNN) and Transformers, both with imbalanced and sampled datasets, as well as with the one-hot encoding approach. We demonstrate that the proposed augmentation pipeline, combined with FS-Embedding, increases convergence speed and leads to a significant reduction in the number of model parameters, all while maintaining the same performance in terms of accuracy.

A Pipeline of Augmentation and Sequence Embedding for Classification of Imbalanced Network Traffic

TL;DR

The paper tackles imbalanced network traffic classification by coupling an LSTM+KDE augmentation pipeline with a Flow-as-Sentence embedding (FS-Embedding) and a Transformer encoder classifier. This approach balances sparse classes while replacing sparse one-hot features with dense embeddings, enabling faster convergence and reduced parameter counts. Empirical results on a real 19-class traffic dataset show that augmentation improves learning efficiency and that FS-Embedding can achieve comparable or superior accuracy with substantially fewer parameters than one-hot baselines. Overall, the method offers a scalable, efficient strategy for robust NTC in imbalanced, real-world settings.

Abstract

Network Traffic Classification (NTC) is one of the most important tasks in network management. The imbalanced nature of classes on the internet presents a critical challenge in classification tasks. For example, some classes of applications are much more prevalent than others, such as HTTP. As a result, machine learning classification models do not perform well on those classes with fewer data. To address this problem, we propose a pipeline to balance the dataset and classify it using a robust and accurate embedding technique. First, we generate artificial data using Long Short-Term Memory (LSTM) networks and Kernel Density Estimation (KDE). Next, we propose replacing one-hot encoding for categorical features with a novel embedding framework based on the "Flow as a Sentence" perspective, which we name FS-Embedding. This framework treats the source and destination ports, along with the packet's direction, as one word in a flow, then trains an embedding vector space based on these new features through the learning classification task. Finally, we compare our pipeline with the training of a Convolutional Recurrent Neural Network (CRNN) and Transformers, both with imbalanced and sampled datasets, as well as with the one-hot encoding approach. We demonstrate that the proposed augmentation pipeline, combined with FS-Embedding, increases convergence speed and leads to a significant reduction in the number of model parameters, all while maintaining the same performance in terms of accuracy.

Paper Structure

This paper contains 14 sections, 6 equations, 4 figures, 11 tables, 1 algorithm.

Figures (4)

  • Figure 1: Generating 20 time steps of packet sequence with LSTM
  • Figure 2: FS-Embedding Process
  • Figure 3: Classifier
  • Figure 4: the percentage of different classes of applications in our dataset.