Expectations Versus Reality: Evaluating Intrusion Detection Systems in Practice

Jake Hesford; Daniel Cheng; Alan Wan; Larry Huynh; Seungho Kim; Hyoungshick Kim; Jin B. Hong

Expectations Versus Reality: Evaluating Intrusion Detection Systems in Practice

Jake Hesford, Daniel Cheng, Alan Wan, Larry Huynh, Seungho Kim, Hyoungshick Kim, Jin B. Hong

TL;DR

The paper tackles the problem of comparing intrusion detection systems (IDS) across heterogeneous network environments and datasets, showing that no single IDS uniformly dominates due to dataset and environment differences. It introduces a standardized evaluation pipeline and empirically benchmarks multiple IDS (Kitsune, HELAD, a Deep Neural Network-based IDS, and Slip) across diverse datasets (CICIDS2017, UNSW-NB15, Stratosphere IoT, BoT-IoT, Mirai, ToN-IoT). The Deep Neural Network IDS (sdnn) achieves the highest average F1 score overall, but performance varies by dataset (notably poorer on Stratosphere IoT), while IoT-focused datasets favor Kitsune; HELAD provides high accuracy on some datasets but less consistent recall. The findings highlight cross-dataset variability and practical challenges in IDS deployment, arguing for standardized evaluation, richer and more diverse datasets, and virtualization to bridge research and real-world practice.

Abstract

Our paper provides empirical comparisons between recent IDSs to provide an objective comparison between them to help users choose the most appropriate solution based on their requirements. Our results show that no one solution is the best, but is dependent on external variables such as the types of attacks, complexity, and network environment in the dataset. For example, BoT_IoT and Stratosphere IoT datasets both capture IoT-related attacks, but the deep neural network performed the best when tested using the BoT_IoT dataset while HELAD performed the best when tested using the Stratosphere IoT dataset. So although we found that a deep neural network solution had the highest average F1 scores on tested datasets, it is not always the best-performing one. We further discuss difficulties in using IDS from literature and project repositories, which complicated drawing definitive conclusions regarding IDS selection.

Expectations Versus Reality: Evaluating Intrusion Detection Systems in Practice

TL;DR

Abstract

Expectations Versus Reality: Evaluating Intrusion Detection Systems in Practice

Authors

TL;DR

Abstract

Table of Contents