Table of Contents
Fetching ...

RapidPen: Fully Automated IP-to-Shell Penetration Testing with LLM-based Agents

Sho Nakatani

TL;DR

RapidPen tackles the problem of fully automated IP-to-Shell penetration testing by marrying ReAct-style task planning with retrieval-augmented knowledge of exploits. The system uses a dual-layer data model (PTT) and two dedicated RAG repositories to plan and execute multi-step attacks with iterative feedback in the Act module, enabling autonomous exploitation starting from a single IP. In preliminary HTB Legacy experiments, RapidPen achieved a 60% success rate when leveraging prior success cases, typically delivering a shell within 200–400 seconds at a cost of about $0.3–$0.6 per run, illustrating practical speed and affordability gains. The work demonstrates the potential to democratize pentesting for non-experts while offloading repetitive tasks for professionals, and outlines concrete directions for expanding scope, improving robustness, and approaching real-world deployment.

Abstract

We present RapidPen, a fully automated penetration testing (pentesting) framework that addresses the challenge of achieving an initial foothold (IP-to-Shell) without human intervention. Unlike prior approaches that focus primarily on post-exploitation or require a human-in-the-loop, RapidPen leverages large language models (LLMs) to autonomously discover and exploit vulnerabilities, starting from a single IP address. By integrating advanced ReAct-style task planning (Re) with retrieval-augmented knowledge bases of successful exploits, along with a command-generation and direct execution feedback loop (Act), RapidPen systematically scans services, identifies viable attack vectors, and executes targeted exploits in a fully automated manner. In our evaluation against a vulnerable target from the Hack The Box platform, RapidPen achieved shell access within 200-400 seconds at a per-run cost of approximately \$0.3-\$0.6, demonstrating a 60\% success rate when reusing prior "success-case" data. These results underscore the potential of truly autonomous pentesting for both security novices and seasoned professionals. Organizations without dedicated security teams can leverage RapidPen to quickly identify critical vulnerabilities, while expert pentesters can offload repetitive tasks and focus on complex challenges. Ultimately, our work aims to make penetration testing more accessible and cost-efficient, thereby enhancing the overall security posture of modern software ecosystems.

RapidPen: Fully Automated IP-to-Shell Penetration Testing with LLM-based Agents

TL;DR

RapidPen tackles the problem of fully automated IP-to-Shell penetration testing by marrying ReAct-style task planning with retrieval-augmented knowledge of exploits. The system uses a dual-layer data model (PTT) and two dedicated RAG repositories to plan and execute multi-step attacks with iterative feedback in the Act module, enabling autonomous exploitation starting from a single IP. In preliminary HTB Legacy experiments, RapidPen achieved a 60% success rate when leveraging prior success cases, typically delivering a shell within 200–400 seconds at a cost of about 0.6 per run, illustrating practical speed and affordability gains. The work demonstrates the potential to democratize pentesting for non-experts while offloading repetitive tasks for professionals, and outlines concrete directions for expanding scope, improving robustness, and approaching real-world deployment.

Abstract

We present RapidPen, a fully automated penetration testing (pentesting) framework that addresses the challenge of achieving an initial foothold (IP-to-Shell) without human intervention. Unlike prior approaches that focus primarily on post-exploitation or require a human-in-the-loop, RapidPen leverages large language models (LLMs) to autonomously discover and exploit vulnerabilities, starting from a single IP address. By integrating advanced ReAct-style task planning (Re) with retrieval-augmented knowledge bases of successful exploits, along with a command-generation and direct execution feedback loop (Act), RapidPen systematically scans services, identifies viable attack vectors, and executes targeted exploits in a fully automated manner. In our evaluation against a vulnerable target from the Hack The Box platform, RapidPen achieved shell access within 200-400 seconds at a per-run cost of approximately \0.6, demonstrating a 60\% success rate when reusing prior "success-case" data. These results underscore the potential of truly autonomous pentesting for both security novices and seasoned professionals. Organizations without dedicated security teams can leverage RapidPen to quickly identify critical vulnerabilities, while expert pentesters can offload repetitive tasks and focus on complex challenges. Ultimately, our work aims to make penetration testing more accessible and cost-efficient, thereby enhancing the overall security posture of modern software ecosystems.

Paper Structure

This paper contains 61 sections, 8 figures.

Figures (8)

  • Figure 1: The high-level architecture of RapidPen, illustrating user inputs and outputs, the Re and Act modules, and RapidPen-vis, a visualization tool for monitoring intermediate processes and final reports.
  • Figure 2: The Re module in RapidPen consists of the Re (L1) PTT Planner and Re (L1) PTT Prioritizer submodules. The PTT Planner is responsible for expanding and maintaining the PTT tree, while the PTT Prioritizer determines the next task to execute.
  • Figure 3: The Re (L1) PTT Planner processes the command results from the last executed task to generate new tasks at Level 2 (L2) in the PTT. These tasks are deduplicated using an LLM-based approach before being merged into the PTT, updating it from an old to a new state.
  • Figure 4: The Re (L2) New Task Generation module generates new tasks based on historical success cases. The process begins by extracting command results from the last executed task. An LLM queries relevant historical success cases, which are then analyzed to extract key insights. Based on this analysis, new tasks are generated and integrated into the planning process.
  • Figure 5: The Act (L1) module processes runnable tasks through three key stages: command generation, execution, and log analysis. It leverages offensive security expertise to generate commands, automates execution, and applies self-correcting mechanisms when necessary. Successful executions produce a command result, while failed executions trigger a feedback loop for improvement.
  • ...and 3 more figures