Table of Contents
Fetching ...

Evaluating ML-Based Anomaly Detection Across Datasets of Varied Integrity: A Case Study

Adrian Pekar, Richard Jozsa

TL;DR

This study tackles how data integrity in network traffic datasets affects ML-based anomaly detection. It introduces two NFStream-generated refinements of CICIDS-2017 (NFS-2023-nTE and NFS-2023-TE) and compares RF performance across original and refined datasets, including WTMC-2021 and CRiSIS-2022, in binary and multi-class tasks. The work demonstrates RF robustness to dataset imperfections while highlighting how flow expiration policies and labeling approaches shape data characteristics, and it provides a fully reproducible methodology for generating and evaluating such datasets. The findings underscore the importance of methodologically sound data generation for reliable cybersecurity ML research and offer open-access resources to advance reproducibility and future enhancements.

Abstract

Cybersecurity remains a critical challenge in the digital age, with network traffic flow anomaly detection being a key pivotal instrument in the fight against cyber threats. In this study, we address the prevalent issue of data integrity in network traffic datasets, which are instrumental in developing machine learning (ML) models for anomaly detection. We introduce two refined versions of the CICIDS-2017 dataset, NFS-2023-nTE and NFS-2023-TE, processed using NFStream to ensure methodologically sound flow expiration and labeling. Our research contrasts the performance of the Random Forest (RF) algorithm across the original CICIDS-2017, its refined counterparts WTMC-2021 and CRiSIS-2022, and our NFStream-generated datasets, in both binary and multi-class classification contexts. We observe that the RF model exhibits exceptional robustness, achieving consistent high-performance metrics irrespective of the underlying dataset quality, which prompts a critical discussion on the actual impact of data integrity on ML efficacy. Our study underscores the importance of continual refinement and methodological rigor in dataset generation for network security research. As the landscape of network threats evolves, so must the tools and techniques used to detect and analyze them.

Evaluating ML-Based Anomaly Detection Across Datasets of Varied Integrity: A Case Study

TL;DR

This study tackles how data integrity in network traffic datasets affects ML-based anomaly detection. It introduces two NFStream-generated refinements of CICIDS-2017 (NFS-2023-nTE and NFS-2023-TE) and compares RF performance across original and refined datasets, including WTMC-2021 and CRiSIS-2022, in binary and multi-class tasks. The work demonstrates RF robustness to dataset imperfections while highlighting how flow expiration policies and labeling approaches shape data characteristics, and it provides a fully reproducible methodology for generating and evaluating such datasets. The findings underscore the importance of methodologically sound data generation for reliable cybersecurity ML research and offer open-access resources to advance reproducibility and future enhancements.

Abstract

Cybersecurity remains a critical challenge in the digital age, with network traffic flow anomaly detection being a key pivotal instrument in the fight against cyber threats. In this study, we address the prevalent issue of data integrity in network traffic datasets, which are instrumental in developing machine learning (ML) models for anomaly detection. We introduce two refined versions of the CICIDS-2017 dataset, NFS-2023-nTE and NFS-2023-TE, processed using NFStream to ensure methodologically sound flow expiration and labeling. Our research contrasts the performance of the Random Forest (RF) algorithm across the original CICIDS-2017, its refined counterparts WTMC-2021 and CRiSIS-2022, and our NFStream-generated datasets, in both binary and multi-class classification contexts. We observe that the RF model exhibits exceptional robustness, achieving consistent high-performance metrics irrespective of the underlying dataset quality, which prompts a critical discussion on the actual impact of data integrity on ML efficacy. Our study underscores the importance of continual refinement and methodological rigor in dataset generation for network security research. As the landscape of network threats evolves, so must the tools and techniques used to detect and analyze them.
Paper Structure (40 sections, 2 figures, 10 tables, 2 algorithms)