Impacts of Data Preprocessing and Hyperparameter Optimization on the Performance of Machine Learning Models Applied to Intrusion Detection Systems
Mateus Guimarães Lima, Antony Carvalho, João Gabriel Álvares, Clayton Escouper das Chagas, Ronaldo Ribeiro Goldschmidt
TL;DR
This study systematically evaluates how data preprocessing and hyperparameter optimization affect the performance and efficiency of ML-based intrusion detection systems. By applying three experimental scenarios to two benchmark datasets and evaluating multiple classifiers, it demonstrates that carefully designed preprocessing (including outlier filtering, normalization, and feature selection) plus grid-search hyperparameter tuning yield robust improvements in predictive metrics and substantial reductions in training and testing times. The findings offer practical guidance for deploying faster, more reliable IDS models in dynamic network environments. The work also lays groundwork for future integration with deep learning and AutoML approaches in cyber security contexts.
Abstract
In the context of cybersecurity of modern communications networks, Intrusion Detection Systems (IDS) have been continuously improved, many of them incorporating machine learning (ML) techniques to identify threats. Although there are researches focused on the study of these techniques applied to IDS, the state-of-the-art lacks works concentrated exclusively on the evaluation of the impacts of data pre-processing actions and the optimization of the values of the hyperparameters of the ML algorithms in the construction of the models of threat identification. This article aims to present a study that fills this research gap. For that, experiments were carried out with two data sets, comparing attack scenarios with variations of pre-processing techniques and optimization of hyperparameters. The results confirm that the proper application of these techniques, in general, makes the generated classification models more robust and greatly reduces the execution times of these models' training and testing processes.
