A Pipeline of Augmentation and Sequence Embedding for Classification of Imbalanced Network Traffic
Matin Shokri, Ramin Hasibi
TL;DR
The paper tackles imbalanced network traffic classification by coupling an LSTM+KDE augmentation pipeline with a Flow-as-Sentence embedding (FS-Embedding) and a Transformer encoder classifier. This approach balances sparse classes while replacing sparse one-hot features with dense embeddings, enabling faster convergence and reduced parameter counts. Empirical results on a real 19-class traffic dataset show that augmentation improves learning efficiency and that FS-Embedding can achieve comparable or superior accuracy with substantially fewer parameters than one-hot baselines. Overall, the method offers a scalable, efficient strategy for robust NTC in imbalanced, real-world settings.
Abstract
Network Traffic Classification (NTC) is one of the most important tasks in network management. The imbalanced nature of classes on the internet presents a critical challenge in classification tasks. For example, some classes of applications are much more prevalent than others, such as HTTP. As a result, machine learning classification models do not perform well on those classes with fewer data. To address this problem, we propose a pipeline to balance the dataset and classify it using a robust and accurate embedding technique. First, we generate artificial data using Long Short-Term Memory (LSTM) networks and Kernel Density Estimation (KDE). Next, we propose replacing one-hot encoding for categorical features with a novel embedding framework based on the "Flow as a Sentence" perspective, which we name FS-Embedding. This framework treats the source and destination ports, along with the packet's direction, as one word in a flow, then trains an embedding vector space based on these new features through the learning classification task. Finally, we compare our pipeline with the training of a Convolutional Recurrent Neural Network (CRNN) and Transformers, both with imbalanced and sampled datasets, as well as with the one-hot encoding approach. We demonstrate that the proposed augmentation pipeline, combined with FS-Embedding, increases convergence speed and leads to a significant reduction in the number of model parameters, all while maintaining the same performance in terms of accuracy.
