Table of Contents
Fetching ...

A Novel Approach to Network Traffic Analysis: the HERA tool

Daniela Pinto, Ivone Amorim, Eva Maia, Isabel Praça

TL;DR

The paper addresses the need for reliable, customizable NID datasets by introducing HERA, an open-source pipeline that generates flow files and labelled or unlabelled datasets from PCAPs via the Argus flow exporter, with flexible feature sets and an integrated labeling workflow. It provides a four-component workflow (Workspace Definition, Flow File Generation, Dataset Creation, Dataset Labelling) and demonstrates accuracy in flow generation and labeling using the UNSW-NB15 dataset, alongside ML evaluations (supervised and unsupervised) that show the generated data yield meaningful threat indicators. Key contributions include enabling 130+ features, ground-truth-based labeling, and support for both small- and large-scale datasets through ra/racluster, all with validated performance. The work has practical impact by offering researchers and practitioners an open, reproducible tool to produce high-quality NID datasets, addressing limitations of CICFlowMeter and related tools.

Abstract

Cybersecurity threats highlight the need for robust network intrusion detection systems to identify malicious behaviour. These systems rely heavily on large datasets to train machine learning models capable of detecting patterns and predicting threats. In the past two decades, researchers have produced a multitude of datasets, however, some widely utilised recent datasets generated with CICFlowMeter contain inaccuracies. These result in flow generation and feature extraction inconsistencies, leading to skewed results and reduced system effectiveness. Other tools in this context lack ease of use, customizable feature sets, and flow labelling options. In this work, we introduce HERA, a new open-source tool that generates flow files and labelled or unlabelled datasets with user-defined features. Validated and tested with the UNSW-NB15 dataset, HERA demonstrated accurate flow and label generation.

A Novel Approach to Network Traffic Analysis: the HERA tool

TL;DR

The paper addresses the need for reliable, customizable NID datasets by introducing HERA, an open-source pipeline that generates flow files and labelled or unlabelled datasets from PCAPs via the Argus flow exporter, with flexible feature sets and an integrated labeling workflow. It provides a four-component workflow (Workspace Definition, Flow File Generation, Dataset Creation, Dataset Labelling) and demonstrates accuracy in flow generation and labeling using the UNSW-NB15 dataset, alongside ML evaluations (supervised and unsupervised) that show the generated data yield meaningful threat indicators. Key contributions include enabling 130+ features, ground-truth-based labeling, and support for both small- and large-scale datasets through ra/racluster, all with validated performance. The work has practical impact by offering researchers and practitioners an open, reproducible tool to produce high-quality NID datasets, addressing limitations of CICFlowMeter and related tools.

Abstract

Cybersecurity threats highlight the need for robust network intrusion detection systems to identify malicious behaviour. These systems rely heavily on large datasets to train machine learning models capable of detecting patterns and predicting threats. In the past two decades, researchers have produced a multitude of datasets, however, some widely utilised recent datasets generated with CICFlowMeter contain inaccuracies. These result in flow generation and feature extraction inconsistencies, leading to skewed results and reduced system effectiveness. Other tools in this context lack ease of use, customizable feature sets, and flow labelling options. In this work, we introduce HERA, a new open-source tool that generates flow files and labelled or unlabelled datasets with user-defined features. Validated and tested with the UNSW-NB15 dataset, HERA demonstrated accurate flow and label generation.
Paper Structure (13 sections, 4 figures, 4 tables)

This paper contains 13 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Dataset creation process.
  • Figure 2: HERA's flowchart with the main components identified.
  • Figure 3: Percentage of malicious traffic in the different versions of the dataset.
  • Figure 4: Percentage of malicious traffic in different dataset versions (k=2).