Table of Contents
Fetching ...

ENCODE: Encoding NetFlows for Network Anomaly Detection

Clinton Cao, Annibale Panichella, Sicco Verwer, Agathe Blaise, Filippo Rebecchi

TL;DR

ENCODE addresses the challenge of preprocessing NetFlow data for anomaly detection by encoding feature values based on their frequency and contextual co-occurrence, inspired by Word2Vec. The method constructs a co-occurrence matrix, derives contextual vectors, clusters them with $K$-means, and uses cluster labels as encodings, applied to datasets including a Kubernetes-based AssureMOSS dataset and public nets. Using unsupervised models (state machines, IF, LOF, DeepLog) on encoded features, ENCODE yields significant improvements in anomaly-detection performance, with state machines showing the strongest gains and robustness to perturbations. The work also introduces a Kubernetes NetFlow dataset and testbed for reproducibility and demonstrates the practical viability of real-time streaming extensions.

Abstract

NetFlow data is a popular network log format used by many network analysts and researchers. The advantages of using NetFlow over deep packet inspection are that it is easier to collect and process, and it is less privacy intrusive. Many works have used machine learning to detect network attacks using NetFlow data. The first step for these machine learning pipelines is to pre-process the data before it is given to the machine learning algorithm. Many approaches exist to pre-process NetFlow data; however, these simply apply existing methods to the data, not considering the specific properties of network data. We argue that for data originating from software systems, such as NetFlow or software logs, similarities in frequency and contexts of feature values are more important than similarities in the value itself. In this work, we propose an encoding algorithm that directly takes the frequency and the context of the feature values into account when the data is being processed. Different types of network behaviours can be clustered using this encoding, thus aiding the process of detecting anomalies within the network. We train several machine learning models for anomaly detection using the data that has been encoded with our encoding algorithm. We evaluate the effectiveness of our encoding on a new dataset that we created for network attacks on Kubernetes clusters and two well-known public NetFlow datasets. We empirically demonstrate that the machine learning models benefit from using our encoding for anomaly detection.

ENCODE: Encoding NetFlows for Network Anomaly Detection

TL;DR

ENCODE addresses the challenge of preprocessing NetFlow data for anomaly detection by encoding feature values based on their frequency and contextual co-occurrence, inspired by Word2Vec. The method constructs a co-occurrence matrix, derives contextual vectors, clusters them with -means, and uses cluster labels as encodings, applied to datasets including a Kubernetes-based AssureMOSS dataset and public nets. Using unsupervised models (state machines, IF, LOF, DeepLog) on encoded features, ENCODE yields significant improvements in anomaly-detection performance, with state machines showing the strongest gains and robustness to perturbations. The work also introduces a Kubernetes NetFlow dataset and testbed for reproducibility and demonstrates the practical viability of real-time streaming extensions.

Abstract

NetFlow data is a popular network log format used by many network analysts and researchers. The advantages of using NetFlow over deep packet inspection are that it is easier to collect and process, and it is less privacy intrusive. Many works have used machine learning to detect network attacks using NetFlow data. The first step for these machine learning pipelines is to pre-process the data before it is given to the machine learning algorithm. Many approaches exist to pre-process NetFlow data; however, these simply apply existing methods to the data, not considering the specific properties of network data. We argue that for data originating from software systems, such as NetFlow or software logs, similarities in frequency and contexts of feature values are more important than similarities in the value itself. In this work, we propose an encoding algorithm that directly takes the frequency and the context of the feature values into account when the data is being processed. Different types of network behaviours can be clustered using this encoding, thus aiding the process of detecting anomalies within the network. We train several machine learning models for anomaly detection using the data that has been encoded with our encoding algorithm. We evaluate the effectiveness of our encoding on a new dataset that we created for network attacks on Kubernetes clusters and two well-known public NetFlow datasets. We empirically demonstrate that the machine learning models benefit from using our encoding for anomaly detection.
Paper Structure (26 sections, 17 figures, 3 tables)

This paper contains 26 sections, 17 figures, 3 tables.

Figures (17)

  • Figure 1: A small subset of NetFlow data extracted from the UGR-16 dataset.
  • Figure 2: Frequencies of the direct previous and next bytes values for each of the unique byte values that are present in Figure \ref{['fig:ugr_small_netflow_sample']}.
  • Figure 3: Cluster labels assigned to the unique byte values used in the example. The assigned cluster labels are used as the encoding for the unique byte values. In this example, we have used five clusters to cluster the vectors.
  • Figure 4: Generalized vector structure used for each unique feature value within ENCODE.
  • Figure 5: High-level overview of our data processing pipeline. The pipeline shows how the encoding is created for the given NetFlow input data using our encoding algorithm.
  • ...and 12 more figures