Table of Contents
Fetching ...

I Can't Believe It's Not a Valid Exploit

Derin Gezgin, Amartya Das, Shinhae Kim, Zhengdong Huang, Nevena Stojkovic, Claire Wang

TL;DR

The paper tackles the challenge of reliably generating PoC exploits for Java vulnerabilities using LLMs by introducing PoC-Gym, a three-stage framework that incorporates static taint-trace guidance and dynamic execution-based validation. It evaluates PoC-Gym on 20 real CVEs and finds that automated success signals substantially overestimate actual exploitation, with post-hoc validation revealing substantial false positives. Static trace guidance reduces false positives but does not eliminate them, and even with improvements over prior work like FaultLine, a large fraction of PoCs remains invalid when checked against ground-truth vulnerability locations. The work highlights the need for stronger execution-level guarantees and post-hoc validation to ensure that LLM-assisted PoC generation reflects genuine exploitation effects rather than surface-level indications.

Abstract

Recently Large Language Models (LLMs) have been used in security vulnerability detection tasks including generating proof-of-concept (PoC) exploits. A PoC exploit is a program used to demonstrate how a vulnerability can be exploited. Several approaches suggest that supporting LLMs with additional guidance can improve PoC generation outcomes, motivating further evaluation of their effectiveness. In this work, we develop PoC-Gym, a framework for PoC generation for Java security vulnerabilities via LLMs and systematic validation of generated exploits. Using PoC-Gym, we evaluate whether the guidance from static analysis tools improves the PoC generation success rate and manually inspect the resulting PoCs. Our results from running PoC-Gym with Claude Sonnet 4, GPT-5 Medium, and gpt-oss-20b show that using static analysis for guidance and criteria lead to 21% higher success rates than the prior baseline, FaultLine. However, manual inspection of both successful and failed PoCs reveals that 71.5% of the PoCs are invalid. These results show that the reported success of LLM-based PoC generation can be significantly misleading, which is hard to detect with current validation mechanisms.

I Can't Believe It's Not a Valid Exploit

TL;DR

The paper tackles the challenge of reliably generating PoC exploits for Java vulnerabilities using LLMs by introducing PoC-Gym, a three-stage framework that incorporates static taint-trace guidance and dynamic execution-based validation. It evaluates PoC-Gym on 20 real CVEs and finds that automated success signals substantially overestimate actual exploitation, with post-hoc validation revealing substantial false positives. Static trace guidance reduces false positives but does not eliminate them, and even with improvements over prior work like FaultLine, a large fraction of PoCs remains invalid when checked against ground-truth vulnerability locations. The work highlights the need for stronger execution-level guarantees and post-hoc validation to ensure that LLM-assisted PoC generation reflects genuine exploitation effects rather than surface-level indications.

Abstract

Recently Large Language Models (LLMs) have been used in security vulnerability detection tasks including generating proof-of-concept (PoC) exploits. A PoC exploit is a program used to demonstrate how a vulnerability can be exploited. Several approaches suggest that supporting LLMs with additional guidance can improve PoC generation outcomes, motivating further evaluation of their effectiveness. In this work, we develop PoC-Gym, a framework for PoC generation for Java security vulnerabilities via LLMs and systematic validation of generated exploits. Using PoC-Gym, we evaluate whether the guidance from static analysis tools improves the PoC generation success rate and manually inspect the resulting PoCs. Our results from running PoC-Gym with Claude Sonnet 4, GPT-5 Medium, and gpt-oss-20b show that using static analysis for guidance and criteria lead to 21% higher success rates than the prior baseline, FaultLine. However, manual inspection of both successful and failed PoCs reveals that 71.5% of the PoCs are invalid. These results show that the reported success of LLM-based PoC generation can be significantly misleading, which is hard to detect with current validation mechanisms.
Paper Structure (34 sections, 3 equations, 3 figures, 9 tables)

This paper contains 34 sections, 3 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of the PoC-Gym pipeline which consists of three main stages: prompt construction, PoC generation, and PoC validation with feedback.
  • Figure 2: Distribution of the manual analysis results for the multi-trace runs. The plain run results are given in Appendix \ref{['subsec:manual']}.
  • Figure 3: Detailed results of the manual analysis pipeline for the no-trace experiment results.