Systematic review and characterisation of malicious industrial network traffic datasets

Martin Dobler; Michael Hellwig; Nuno Lopes; Ken Oakley; Mike Winterburn

Systematic review and characterisation of malicious industrial network traffic datasets

Martin Dobler, Michael Hellwig, Nuno Lopes, Ken Oakley, Mike Winterburn

TL;DR

Industrial OT/IIoT networks introduce cybersecurity risks alongside opportunities for standardised data access. The paper conducts a systematic review to identify publicly available industrial traffic datasets, maps 97 attacks to the Cyber Kill Chain (CKC), and assesses ML-readiness and dataset complexity using the Problexity framework, reporting an average complexity score of $CS \approx 0.323$ across ML-ready datasets and an average class-imbalance ratio of $IR = \frac{| ext{majority class}|}{|\text{minority class}|}$ of 20.8. Feature analyses across three representative datasets (DNP3, IEC 60870-5-104, HDGM) reveal that many features exhibit varying mutual information and potential artefacts, underscoring the need for careful feature selection and artifact-aware modelling. The study provides practical guidance for practitioners to select appropriate datasets, promotes standardisation of data formats and preprocessing, and outlines future work to broaden attack taxonomies and improve complexity metrics for robust ML-based OT/ICS intrusion detection.

Abstract

The adoption of the Industrial Internet of Things (IIoT) as a complementary technology to Operational Technology (OT) has enabled a new level of standardised data access and process visibility. This convergence of Information Technology (IT), OT, and IIoT has also created new cybersecurity vulnerabilities and risks that must be managed. Artificial Intelligence (AI) is emerging as a powerful tool to monitor OT/IIoT networks for malicious activity and is a highly active area of research. AI researchers are applying advanced Machine Learning (ML) and Deep Learning (DL) techniques to the detection of anomalous or malicious activity in network traffic. They typically use datasets derived from IoT/IIoT/OT network traffic captures to measure the performance of their proposed approaches. Therefore, there is a widespread need for datasets for algorithm testing. This work systematically reviews publicly available network traffic capture-based datasets, including categorisation of contained attack types, review of metadata, and statistical as well as complexity analysis. Each dataset is analysed to provide researchers with metadata that can be used to select the best dataset for their research question. This results in an added benefit to the community as researchers can select the best dataset for their research more easily and according to their specific Machine Learning goals.

Systematic review and characterisation of malicious industrial network traffic datasets

TL;DR

across ML-ready datasets and an average class-imbalance ratio of

of 20.8. Feature analyses across three representative datasets (DNP3, IEC 60870-5-104, HDGM) reveal that many features exhibit varying mutual information and potential artefacts, underscoring the need for careful feature selection and artifact-aware modelling. The study provides practical guidance for practitioners to select appropriate datasets, promotes standardisation of data formats and preprocessing, and outlines future work to broaden attack taxonomies and improve complexity metrics for robust ML-based OT/ICS intrusion detection.

Abstract

Paper Structure (22 sections, 1 equation, 4 figures, 6 tables)

This paper contains 22 sections, 1 equation, 4 figures, 6 tables.

Introduction
Related Work
Methodology
Information Gathering Method
Dataset Identification
Cyber-Attack Framework Selection and Attack Mapping
Exploratory Analysis
Detailed Analysis
Results
Attack Categorisation
Complexity Analysis
Feature Analysis
Discussion
Attack Categorisation
Complexity Analysis
...and 7 more sections

Figures (4)

Figure 1: Overall methodology workflow
Figure 2: Dataset Filtering Workflow
Figure 3: Workflow of the dataset analysis leading to Table \ref{['tab:cs-results']}.
Figure 4: Feature importance plots for (a) high complexity dataset HDGM (Complexity Score 0.479), (b) medium complexity dataset IEC 60870-5-104 (Complexity Score 0.274), and (c) low complexity dataset DNP3 (Complexity Score 0.075).

Systematic review and characterisation of malicious industrial network traffic datasets

TL;DR

Abstract

Systematic review and characterisation of malicious industrial network traffic datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (4)