Table of Contents
Fetching ...

Capturing the security expert knowledge in feature selection for web application attack detection

Amanda Riverol, Gustavo Betarte, Rodrigo Martínez, Álvaro Pardo

TL;DR

This work addresses false positives in web application firewalls by learning a data-driven feature selection strategy using mutual information to prioritize tokens that distinguish benign from malicious HTTP requests. The authors train a semi-supervised One-Class SVM on a Bag-of-Words representation built from a diverse dictionary of attack and normal traffic, with MI guiding feature selection on TF-IDF features. Across Drupal and SR-BH 2020 datasets, the proposed 100-feature MI-based approach achieves high attack detection with substantially lower false positives than ModSecurity baselines and outperforms expert-driven feature sets in several metrics, while avoiding heavy dependence on labeled attack data. The study demonstrates a reproducible pipeline for improving WAF effectiveness, with practical implications for reducing manual expert tuning and enhancing generalization to evolving attack patterns.

Abstract

This article puts forward the use of mutual information values to replicate the expertise of security professionals in selecting features for detecting web attacks. The goal is to enhance the effectiveness of web application firewalls (WAFs). Web applications are frequently vulnerable to various security threats, making WAFs essential for their protection. WAFs analyze HTTP traffic using rule-based approaches to identify known attack patterns and to detect and block potential malicious requests. However, a major challenge is the occurrence of false positives, which can lead to blocking legitimate traffic and impact the normal functioning of the application. The problem is addressed as an approach that combines supervised learning for feature selection with a semi-supervised learning scenario for training a One-Class SVM model. The experimental findings show that the model trained with features selected by the proposed algorithm outperformed the expert-based selection approach in terms of performance. Additionally, the results obtained by the traditional rule-based WAF ModSecurity, configured with a vanilla set of OWASP CRS rules, were also improved.

Capturing the security expert knowledge in feature selection for web application attack detection

TL;DR

This work addresses false positives in web application firewalls by learning a data-driven feature selection strategy using mutual information to prioritize tokens that distinguish benign from malicious HTTP requests. The authors train a semi-supervised One-Class SVM on a Bag-of-Words representation built from a diverse dictionary of attack and normal traffic, with MI guiding feature selection on TF-IDF features. Across Drupal and SR-BH 2020 datasets, the proposed 100-feature MI-based approach achieves high attack detection with substantially lower false positives than ModSecurity baselines and outperforms expert-driven feature sets in several metrics, while avoiding heavy dependence on labeled attack data. The study demonstrates a reproducible pipeline for improving WAF effectiveness, with practical implications for reducing manual expert tuning and enhancing generalization to evolving attack patterns.

Abstract

This article puts forward the use of mutual information values to replicate the expertise of security professionals in selecting features for detecting web attacks. The goal is to enhance the effectiveness of web application firewalls (WAFs). Web applications are frequently vulnerable to various security threats, making WAFs essential for their protection. WAFs analyze HTTP traffic using rule-based approaches to identify known attack patterns and to detect and block potential malicious requests. However, a major challenge is the occurrence of false positives, which can lead to blocking legitimate traffic and impact the normal functioning of the application. The problem is addressed as an approach that combines supervised learning for feature selection with a semi-supervised learning scenario for training a One-Class SVM model. The experimental findings show that the model trained with features selected by the proposed algorithm outperformed the expert-based selection approach in terms of performance. Additionally, the results obtained by the traditional rule-based WAF ModSecurity, configured with a vanilla set of OWASP CRS rules, were also improved.
Paper Structure (17 sections, 5 equations, 8 figures, 2 tables, 3 algorithms)

This paper contains 17 sections, 5 equations, 8 figures, 2 tables, 3 algorithms.

Figures (8)

  • Figure 1: Valid Request
  • Figure 2: Attack Request (SQL Injection)
  • Figure 3: SR-BH 2020 Attack Distribution riera2022new.
  • Figure 4: PKDD Attack Distribution gallagher2009classification.
  • Figure 5: Top 50 Feature Selection Drupal
  • ...and 3 more figures