Table of Contents
Fetching ...

A Systematic Study on Generating Web Vulnerability Proof-of-Concepts Using Large Language Models

Mengyao Zhao, Kaixuan Li, Lyuye Zhang, Wenjing Dang, Chenggong Ding, Sen Chen, Zheli Liu

TL;DR

This study systematically evaluates whether large language models can automatically generate functional web vulnerability PoCs using only publicly disclosed information across three stages of vulnerability disclosure. By benchmarking GPT-4o and DeepSeek-R1 on 100 reproducible CVEs with a four-phase methodology (baseline effectiveness, failure analysis, context augmentation, and adaptive prompting), the authors demonstrate that PoCs can be produced in 8–34% of cases from minimal data, with performance climbing to 68–72% when adaptive reasoning and rich context are employed. The work reveals that context completeness and iterative feedback are critical to performance, and shows that DeepSeek-R1 generally outperforms GPT-4o, especially when function-level context and CoT/ICL strategies are applied. Notably, 14 PoCs generated by the models were accepted by NVD/Exploit DB, underscoring practical relevance, while also highlighting significant dual-use risks that warrant careful disclosure policies. The paper contributes a reproducible vulnerability benchmark, tailored prompts that combine CoT, ICL, and real-time validators, and open resources to spur further research on LLM-assisted PoC generation.

Abstract

Recent advances in Large Language Models (LLMs) have brought remarkable progress in code understanding and reasoning, creating new opportunities and raising new concerns for software security. Among many downstream tasks, generating Proof-of-Concept (PoC) exploits plays a central role in vulnerability reproduction, comprehension, and mitigation. While previous research has focused primarily on zero-day exploitation, the growing availability of rich public information accompanying disclosed CVEs leads to a natural question: can LLMs effectively use this information to automatically generate valid PoCs? In this paper, we present the first empirical study of LLM-based PoC generation for web application vulnerabilities, focusing on the practical feasibility of leveraging publicly disclosed information. We evaluate GPT-4o and DeepSeek-R1 on 100 real-world and reproducible CVEs across three stages of vulnerability disclosure: (1) newly disclosed vulnerabilities with only descriptions, (2) 1-day vulnerabilities with patches, and (3) N-day vulnerabilities with full contextual code. Our results show that LLMs can automatically generate working PoCs in 8%-34% of cases using only public data, with DeepSeek-R1 consistently outperforming GPT-4o. Further analysis shows that supplementing code context improves success rates by 17%-20%, with function-level providing 9%-13% improvement than file-level ones. Further integrating adaptive reasoning strategies to prompt refinement significantly improves success rates to 68%-72%. Our findings suggest that LLMs could reshape vulnerability exploitation dynamics. To date, 23 newly generated PoCs have been accepted by NVD and Exploit DB.

A Systematic Study on Generating Web Vulnerability Proof-of-Concepts Using Large Language Models

TL;DR

This study systematically evaluates whether large language models can automatically generate functional web vulnerability PoCs using only publicly disclosed information across three stages of vulnerability disclosure. By benchmarking GPT-4o and DeepSeek-R1 on 100 reproducible CVEs with a four-phase methodology (baseline effectiveness, failure analysis, context augmentation, and adaptive prompting), the authors demonstrate that PoCs can be produced in 8–34% of cases from minimal data, with performance climbing to 68–72% when adaptive reasoning and rich context are employed. The work reveals that context completeness and iterative feedback are critical to performance, and shows that DeepSeek-R1 generally outperforms GPT-4o, especially when function-level context and CoT/ICL strategies are applied. Notably, 14 PoCs generated by the models were accepted by NVD/Exploit DB, underscoring practical relevance, while also highlighting significant dual-use risks that warrant careful disclosure policies. The paper contributes a reproducible vulnerability benchmark, tailored prompts that combine CoT, ICL, and real-time validators, and open resources to spur further research on LLM-assisted PoC generation.

Abstract

Recent advances in Large Language Models (LLMs) have brought remarkable progress in code understanding and reasoning, creating new opportunities and raising new concerns for software security. Among many downstream tasks, generating Proof-of-Concept (PoC) exploits plays a central role in vulnerability reproduction, comprehension, and mitigation. While previous research has focused primarily on zero-day exploitation, the growing availability of rich public information accompanying disclosed CVEs leads to a natural question: can LLMs effectively use this information to automatically generate valid PoCs? In this paper, we present the first empirical study of LLM-based PoC generation for web application vulnerabilities, focusing on the practical feasibility of leveraging publicly disclosed information. We evaluate GPT-4o and DeepSeek-R1 on 100 real-world and reproducible CVEs across three stages of vulnerability disclosure: (1) newly disclosed vulnerabilities with only descriptions, (2) 1-day vulnerabilities with patches, and (3) N-day vulnerabilities with full contextual code. Our results show that LLMs can automatically generate working PoCs in 8%-34% of cases using only public data, with DeepSeek-R1 consistently outperforming GPT-4o. Further analysis shows that supplementing code context improves success rates by 17%-20%, with function-level providing 9%-13% improvement than file-level ones. Further integrating adaptive reasoning strategies to prompt refinement significantly improves success rates to 68%-72%. Our findings suggest that LLMs could reshape vulnerability exploitation dynamics. To date, 23 newly generated PoCs have been accepted by NVD and Exploit DB.

Paper Structure

This paper contains 36 sections, 2 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: From disclosure to exploit: information availability and PoC impact.
  • Figure 2: Overview of our study.
  • Figure 3: Benchmark of our study.
  • Figure 4: Basic prompt template for PoC generation.
  • Figure 5: Distribution of taint-style vulnerability attack payloads generated by GPT-4o.
  • ...and 14 more figures