Table of Contents
Fetching ...

Automated Vulnerability Validation and Verification: A Large Language Model Approach

Alireza Lotfi, Charalampos Katsis, Elisa Bertino

TL;DR

The paper presents an end-to-end pipeline that uses large language models augmented with retrieval-based context to automate the orchestration, reproduction, and validation of known software vulnerabilities within containerized environments. By extracting CVE details, enriching them via RAG, and generating executable exploit artifacts, the approach enables systematic vulnerability analysis across diverse languages and CWEs. Empirical results across 102 CVEs show 71 reproductions (with 31 failures), revealing CVE description inconsistencies and the value of PoC availability in reducing effort. The work emphasizes reproducibility, model-agnostic deployment, and open-sourcing artifacts to accelerate security research, while outlining future directions for multi-step attacks and domain-specific products.

Abstract

Software vulnerabilities remain a critical security challenge, providing entry points for attackers into enterprise networks. Despite advances in security practices, the lack of high-quality datasets capturing diverse exploit behavior limits effective vulnerability assessment and mitigation. This paper introduces an end-to-end multi-step pipeline leveraging generative AI, specifically large language models (LLMs), to address the challenges of orchestrating and reproducing attacks to known software vulnerabilities. Our approach extracts information from CVE disclosures in the National Vulnerability Database, augments it with external public knowledge (e.g., threat advisories, code snippets) using Retrieval-Augmented Generation (RAG), and automates the creation of containerized environments and exploit code for each vulnerability. The pipeline iteratively refines generated artifacts, validates attack success with test cases, and supports complex multi-container setups. Our methodology overcomes key obstacles, including noisy and incomplete vulnerability descriptions, by integrating LLMs and RAG to fill information gaps. We demonstrate the effectiveness of our pipeline across different vulnerability types, such as memory overflows, denial of service, and remote code execution, spanning diverse programming languages, libraries and years. In doing so, we uncover significant inconsistencies in CVE descriptions, emphasizing the need for more rigorous verification in the CVE disclosure process. Our approach is model-agnostic, working across multiple LLMs, and we open-source the artifacts to enable reproducibility and accelerate security research. To the best of our knowledge, this is the first system to systematically orchestrate and exploit known vulnerabilities in containerized environments by combining general-purpose LLM reasoning with CVE data and RAG-based context enrichment.

Automated Vulnerability Validation and Verification: A Large Language Model Approach

TL;DR

The paper presents an end-to-end pipeline that uses large language models augmented with retrieval-based context to automate the orchestration, reproduction, and validation of known software vulnerabilities within containerized environments. By extracting CVE details, enriching them via RAG, and generating executable exploit artifacts, the approach enables systematic vulnerability analysis across diverse languages and CWEs. Empirical results across 102 CVEs show 71 reproductions (with 31 failures), revealing CVE description inconsistencies and the value of PoC availability in reducing effort. The work emphasizes reproducibility, model-agnostic deployment, and open-sourcing artifacts to accelerate security research, while outlining future directions for multi-step attacks and domain-specific products.

Abstract

Software vulnerabilities remain a critical security challenge, providing entry points for attackers into enterprise networks. Despite advances in security practices, the lack of high-quality datasets capturing diverse exploit behavior limits effective vulnerability assessment and mitigation. This paper introduces an end-to-end multi-step pipeline leveraging generative AI, specifically large language models (LLMs), to address the challenges of orchestrating and reproducing attacks to known software vulnerabilities. Our approach extracts information from CVE disclosures in the National Vulnerability Database, augments it with external public knowledge (e.g., threat advisories, code snippets) using Retrieval-Augmented Generation (RAG), and automates the creation of containerized environments and exploit code for each vulnerability. The pipeline iteratively refines generated artifacts, validates attack success with test cases, and supports complex multi-container setups. Our methodology overcomes key obstacles, including noisy and incomplete vulnerability descriptions, by integrating LLMs and RAG to fill information gaps. We demonstrate the effectiveness of our pipeline across different vulnerability types, such as memory overflows, denial of service, and remote code execution, spanning diverse programming languages, libraries and years. In doing so, we uncover significant inconsistencies in CVE descriptions, emphasizing the need for more rigorous verification in the CVE disclosure process. Our approach is model-agnostic, working across multiple LLMs, and we open-source the artifacts to enable reproducibility and accelerate security research. To the best of our knowledge, this is the first system to systematically orchestrate and exploit known vulnerabilities in containerized environments by combining general-purpose LLM reasoning with CVE data and RAG-based context enrichment.

Paper Structure

This paper contains 23 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Pipeline Overview
  • Figure 2: Experiment Breakdown
  • Figure 3: Stepwise Progression of LLMs Across the Pipeline
  • Figure 4: Influence of PoC on Iteration Average