Enhancing web traffic attacks identification through ensemble methods and feature selection
Daniel Urda, Branly Martínez, Nuño Basurto, Meelis Kull, Ángel Arroyo, Álvaro Herrero
TL;DR
The paper addresses web-traffic attack detection in high-volume websites using ML on HTTP traces. It proposes a feature-extraction pipeline applied to the CSIC2010 v2 dataset and evaluates both baseline classifiers and ensemble methods, with three feature-selection strategies. Ensemble methods (RF and XGBoost) significantly outperform baselines, achieving an average AUC improvement of about 20% and reaching $\text{AUC}$ values as high as $0.989$ for XGBoost; feature selection generally yields limited gains in AUC but reduces the feature count considerably. The work provides a practical framework for deploying robust web-traffic intrusion detectors and offers insights into when feature selection is advantageous, with future extensions including encrypted traffic contexts, real-time deployment, and broader datasets.
Abstract
Websites, as essential digital assets, are highly vulnerable to cyberattacks because of their high traffic volume and the significant impact of breaches. This study aims to enhance the identification of web traffic attacks by leveraging machine learning techniques. A methodology was proposed to extract relevant features from HTTP traces using the CSIC2010 v2 dataset, which simulates e-commerce web traffic. Ensemble methods, such as Random Forest and Extreme Gradient Boosting, were employed and compared against baseline classifiers, including k-nearest Neighbor, LASSO, and Support Vector Machines. The results demonstrate that the ensemble methods outperform baseline classifiers by approximately 20% in predictive accuracy, achieving an Area Under the ROC Curve (AUC) of 0.989. Feature selection methods such as Information Gain, LASSO, and Random Forest further enhance the robustness of these models. This study highlights the efficacy of ensemble models in improving attack detection while minimizing performance variability, offering a practical framework for securing web traffic in diverse application contexts.
