Table of Contents
Fetching ...

Enhancing web traffic attacks identification through ensemble methods and feature selection

Daniel Urda, Branly Martínez, Nuño Basurto, Meelis Kull, Ángel Arroyo, Álvaro Herrero

TL;DR

The paper addresses web-traffic attack detection in high-volume websites using ML on HTTP traces. It proposes a feature-extraction pipeline applied to the CSIC2010 v2 dataset and evaluates both baseline classifiers and ensemble methods, with three feature-selection strategies. Ensemble methods (RF and XGBoost) significantly outperform baselines, achieving an average AUC improvement of about 20% and reaching $\text{AUC}$ values as high as $0.989$ for XGBoost; feature selection generally yields limited gains in AUC but reduces the feature count considerably. The work provides a practical framework for deploying robust web-traffic intrusion detectors and offers insights into when feature selection is advantageous, with future extensions including encrypted traffic contexts, real-time deployment, and broader datasets.

Abstract

Websites, as essential digital assets, are highly vulnerable to cyberattacks because of their high traffic volume and the significant impact of breaches. This study aims to enhance the identification of web traffic attacks by leveraging machine learning techniques. A methodology was proposed to extract relevant features from HTTP traces using the CSIC2010 v2 dataset, which simulates e-commerce web traffic. Ensemble methods, such as Random Forest and Extreme Gradient Boosting, were employed and compared against baseline classifiers, including k-nearest Neighbor, LASSO, and Support Vector Machines. The results demonstrate that the ensemble methods outperform baseline classifiers by approximately 20% in predictive accuracy, achieving an Area Under the ROC Curve (AUC) of 0.989. Feature selection methods such as Information Gain, LASSO, and Random Forest further enhance the robustness of these models. This study highlights the efficacy of ensemble models in improving attack detection while minimizing performance variability, offering a practical framework for securing web traffic in diverse application contexts.

Enhancing web traffic attacks identification through ensemble methods and feature selection

TL;DR

The paper addresses web-traffic attack detection in high-volume websites using ML on HTTP traces. It proposes a feature-extraction pipeline applied to the CSIC2010 v2 dataset and evaluates both baseline classifiers and ensemble methods, with three feature-selection strategies. Ensemble methods (RF and XGBoost) significantly outperform baselines, achieving an average AUC improvement of about 20% and reaching values as high as for XGBoost; feature selection generally yields limited gains in AUC but reduces the feature count considerably. The work provides a practical framework for deploying robust web-traffic intrusion detectors and offers insights into when feature selection is advantageous, with future extensions including encrypted traffic contexts, real-time deployment, and broader datasets.

Abstract

Websites, as essential digital assets, are highly vulnerable to cyberattacks because of their high traffic volume and the significant impact of breaches. This study aims to enhance the identification of web traffic attacks by leveraging machine learning techniques. A methodology was proposed to extract relevant features from HTTP traces using the CSIC2010 v2 dataset, which simulates e-commerce web traffic. Ensemble methods, such as Random Forest and Extreme Gradient Boosting, were employed and compared against baseline classifiers, including k-nearest Neighbor, LASSO, and Support Vector Machines. The results demonstrate that the ensemble methods outperform baseline classifiers by approximately 20% in predictive accuracy, achieving an Area Under the ROC Curve (AUC) of 0.989. Feature selection methods such as Information Gain, LASSO, and Random Forest further enhance the robustness of these models. This study highlights the efficacy of ensemble models in improving attack detection while minimizing performance variability, offering a practical framework for securing web traffic in diverse application contexts.

Paper Structure

This paper contains 11 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 3: Average AUC performance of all classifiers and feature selection methods analyzed.
  • Figure 4: Average performance of all metrics from different perspectives.