Table of Contents
Fetching ...

RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale

Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile, Zach Reavis, David Magnotti, Wayne Fullen

Abstract

Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.

RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale

Abstract

Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.

Paper Structure

This paper contains 35 sections, 10 figures.

Figures (10)

  • Figure 1: RuleForge architecture showing CVE Repository, Rule Generation Engine, Validation Pipeline, and Feedback Integration components.
  • Figure 2: 5×5 Generation Strategy showing five parallel rule candidates with confidence scores and iterative refinement process. The system generates multiple candidates simultaneously and selects the best performer based on validation results.
  • Figure 3: Systemic Feedback Loop mechanism showing how validation failures generate specific feedback that is integrated into subsequent rule generation attempts. Examples include feedback about missed patterns, overly broad rules, and specificity issues.
  • Figure 4: CVE Rule Selection Process Flow. The system processes CVEs without existing rules through three parallel analysis components: content-based keyword matching, CISA KEV status verification, and cybersecurity news feed presence checking. These inputs feed into a weighted scoring algorithm that ranks CVEs by priority, with the highest-scoring vulnerabilities selected first for automated rule generation.
  • Figure 5: Candidate Rule Validation Pipeline. Each generated rule passes sequentially through four stages: (1) Synthetic Testing, (2) LLM-as-a-Judge Confidence Scoring, (3) IP Validation against production web traffic, and (4) Human Review. Approved rules proceed to Production Deployment. Failures at any stage are routed to Feedback Integration, which provides stage-specific reasoning back to the Rule Generation Engine for iterative refinement.
  • ...and 5 more figures