Table of Contents
Fetching ...

HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

Lajos Muzsai, David Imolai, András Lukács

TL;DR

HackSynth advances autonomous cybersecurity by integrating Planner and Summarizer LLM modules to perform iterative, self-guided penetration testing within a secured containerized Kali environment. It introduces two standardized benchmarks based on PicoCTF and OverTheWire to evaluate autonomous agents across diverse domains, difficulties, and dynamic flags. Through parameter sweeps and multi-model evaluations, the work shows GPT-4o delivering the strongest performance and highlights safety concerns, containment strategies, and trade-offs between creativity and reliability. The study underscores the potential of LLM-based agents in autonomous pentesting while providing public benchmarks and a roadmap for safer, scalable future research.

Abstract

We introduce HackSynth, a novel Large Language Model (LLM)-based agent capable of autonomous penetration testing. HackSynth's dual-module architecture includes a Planner and a Summarizer, which enable it to generate commands and process feedback iteratively. To benchmark HackSynth, we propose two new Capture The Flag (CTF)-based benchmark sets utilizing the popular platforms PicoCTF and OverTheWire. These benchmarks include two hundred challenges across diverse domains and difficulties, providing a standardized framework for evaluating LLM-based penetration testing agents. Based on these benchmarks, extensive experiments are presented, analyzing the core parameters of HackSynth, including creativity (temperature and top-p) and token utilization. Multiple open source and proprietary LLMs were used to measure the agent's capabilities. The experiments show that the agent performed best with the GPT-4o model, better than what the GPT-4o's system card suggests. We also discuss the safety and predictability of HackSynth's actions. Our findings indicate the potential of LLM-based agents in advancing autonomous penetration testing and the importance of robust safeguards. HackSynth and the benchmarks are publicly available to foster research on autonomous cybersecurity solutions.

HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

TL;DR

HackSynth advances autonomous cybersecurity by integrating Planner and Summarizer LLM modules to perform iterative, self-guided penetration testing within a secured containerized Kali environment. It introduces two standardized benchmarks based on PicoCTF and OverTheWire to evaluate autonomous agents across diverse domains, difficulties, and dynamic flags. Through parameter sweeps and multi-model evaluations, the work shows GPT-4o delivering the strongest performance and highlights safety concerns, containment strategies, and trade-offs between creativity and reliability. The study underscores the potential of LLM-based agents in autonomous pentesting while providing public benchmarks and a roadmap for safer, scalable future research.

Abstract

We introduce HackSynth, a novel Large Language Model (LLM)-based agent capable of autonomous penetration testing. HackSynth's dual-module architecture includes a Planner and a Summarizer, which enable it to generate commands and process feedback iteratively. To benchmark HackSynth, we propose two new Capture The Flag (CTF)-based benchmark sets utilizing the popular platforms PicoCTF and OverTheWire. These benchmarks include two hundred challenges across diverse domains and difficulties, providing a standardized framework for evaluating LLM-based penetration testing agents. Based on these benchmarks, extensive experiments are presented, analyzing the core parameters of HackSynth, including creativity (temperature and top-p) and token utilization. Multiple open source and proprietary LLMs were used to measure the agent's capabilities. The experiments show that the agent performed best with the GPT-4o model, better than what the GPT-4o's system card suggests. We also discuss the safety and predictability of HackSynth's actions. Our findings indicate the potential of LLM-based agents in advancing autonomous penetration testing and the importance of robust safeguards. HackSynth and the benchmarks are publicly available to foster research on autonomous cybersecurity solutions.

Paper Structure

This paper contains 25 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: High level overview of the architecture of HackSynth.
  • Figure 2: Distribution of benchmark challenges across categories and difficulty levels for (a) PicoCTF and (b) OverTheWire. These bar charts display the number of challenges in each category—General Skills, Web Exploitation, Cryptography, Binary Exploitation, Forensics, and Reverse Engineering—classified by difficulty (Easy, Medium, Hard).
  • Figure 3: Effect of new observation window size on completed challenges. The plot shows the number of completed challenges as the observation window size increases, highlighting an optimal range for efficient summarization and processing.
  • Figure 4: Impact of the Temperature Parameter on Challenge Completion by HackSynth. This plot illustrates the relationship between the temperature parameter (ranging from 0.0 to 2.0) and the number of completed challenges. The trend indicates stable performance at lower temperatures, with a marked decline in successful completions as temperature increases, reflecting the decreased coherence and increased randomness in generated outputs at higher values.
  • Figure 5: Impact of the Temperature Parameter on Error Rate in Command Generation. This plot depicts the probability of commands resulting in errors as the temperature parameter is adjusted. Error rates remain stable between temperatures 0.0 and 1.0, but increase proportionally with higher temperatures, highlighting the trade-off between variability and reliability in command execution at elevated temperature settings.
  • ...and 4 more figures