ModSec-Learn: Boosting ModSecurity with Machine Learning

Christian Scano; Giuseppe Floris; Biagio Montaruli; Luca Demetrio; Andrea Valenza; Luca Compagna; Davide Ariu; Luca Piras; Davide Balzarotti; Battista Biggio

ModSec-Learn: Boosting ModSecurity with Machine Learning

Christian Scano, Giuseppe Floris, Biagio Montaruli, Luca Demetrio, Andrea Valenza, Luca Compagna, Davide Ariu, Luca Piras, Davide Balzarotti, Battista Biggio

TL;DR

This work addresses the mismatch between traditional ModSecurity CRS rule weights and real-world web traffic by introducing ModSec-Learn, which treats CRS rules as $52$ binary features and learns per-rule weights with models such as SVM, Logistic Regression, and Random Forest. By replacing heuristic severities with data-driven weights and applying sparse regularization, ModSec-Learn achieves a substantial improvement in the true positive rate at a fixed low false-positive rate ($>45 ext{%}$ at $1 ext{%}$ FPR) and can discard a significant fraction of rules ($ oughly 30 ext{%}$) without sacrificing performance. The authors validate their approach on a new dataset built from legitimate traffic and diverse SQLi payloads, and show consistent gains across PL configurations, with $ ext{L}_1$ regularization enabling automatic rule selection (e.g., $18$ zero-weight rules). The work demonstrates a practical path to augment traditional WAFs with machine learning, potentially generalizing to other threats and rule-sets, and provides open-source code and datasets to foster further research.

Abstract

ModSecurity is widely recognized as the standard open-source Web Application Firewall (WAF), maintained by the OWASP Foundation. It detects malicious requests by matching them against the Core Rule Set (CRS), identifying well-known attack patterns. Each rule is manually assigned a weight based on the severity of the corresponding attack, and a request is blocked if the sum of the weights of matched rules exceeds a given threshold. However, we argue that this strategy is largely ineffective against web attacks, as detection is only based on heuristics and not customized on the application to protect. In this work, we overcome this issue by proposing a machine-learning model that uses the CRS rules as input features. Through training, ModSec-Learn is able to tune the contribution of each CRS rule to predictions, thus adapting the severity level to the web applications to protect. Our experiments show that ModSec-Learn achieves a significantly better trade-off between detection and false positive rates. Finally, we analyze how sparse regularization can reduce the number of rules that are relevant at inference time, by discarding more than 30% of the CRS rules. We release our open-source code and the dataset at https://github.com/pralab/modsec-learn and https://github.com/pralab/http-traffic-dataset, respectively.

ModSec-Learn: Boosting ModSecurity with Machine Learning

TL;DR

This work addresses the mismatch between traditional ModSecurity CRS rule weights and real-world web traffic by introducing ModSec-Learn, which treats CRS rules as

binary features and learns per-rule weights with models such as SVM, Logistic Regression, and Random Forest. By replacing heuristic severities with data-driven weights and applying sparse regularization, ModSec-Learn achieves a substantial improvement in the true positive rate at a fixed low false-positive rate (

FPR) and can discard a significant fraction of rules (

) without sacrificing performance. The authors validate their approach on a new dataset built from legitimate traffic and diverse SQLi payloads, and show consistent gains across PL configurations, with

regularization enabling automatic rule selection (e.g.,

zero-weight rules). The work demonstrates a practical path to augment traditional WAFs with machine learning, potentially generalizing to other threats and rule-sets, and provides open-source code and datasets to foster further research.

Abstract

Paper Structure (11 sections, 3 figures, 1 table)

This paper contains 11 sections, 3 figures, 1 table.

Introduction
Background
Improving Modsecurity with Machine Learning
Experimental Analysis
Experimental Setup
Evaluation of ModSecurity
Evaluation of ModSec-Learn
Imposing Sparsity through Regularization
Related Work
Conclusion and Future Work
Acknowledgments.

Figures (3)

Figure 1: ModSec-Learn architecture. A machine-learning model is trained using the CRS rules as input features (52 features) to improve the trade-off between detection rate and false alarms. This amounts to learning a model of the incoming traffic directed towards the protected web services. Sparse regularization can also be used to select a subset of the available rules, instead of using PLs.
Figure 2: ROC curves of ModSecurity vanilla (ModSec) and ModSec-Learn (SVM, RF, and LR), evaluated on test. Each curve reports the average detection rate of SQLi attacks (i.e., the True Positive Rate) against the fraction of misclassified benign SQL queries (i.e., the False Positive Rate). The zoomed section helps to understand the performance of each model when lines overlap.
Figure 3: Weight values learned at PL 4 by ModSec-Learn LR - $\ell_1$ (blue) and ModSec-Learn LR - $\ell_2$ (light red), and the weight used by ModSecurity vanilla (green). The additional color, i.e., red, is given by the overlapping of the green and blue bars with the light red ones. We only report the last three digits of the rule IDs on the x-axis as the first three digits are equal to 942 for all rules.

ModSec-Learn: Boosting ModSecurity with Machine Learning

TL;DR

Abstract

ModSec-Learn: Boosting ModSecurity with Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)