HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

Lajos Muzsai; David Imolai; András Lukács

HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

Lajos Muzsai, David Imolai, András Lukács

TL;DR

HackSynth advances autonomous cybersecurity by integrating Planner and Summarizer LLM modules to perform iterative, self-guided penetration testing within a secured containerized Kali environment. It introduces two standardized benchmarks based on PicoCTF and OverTheWire to evaluate autonomous agents across diverse domains, difficulties, and dynamic flags. Through parameter sweeps and multi-model evaluations, the work shows GPT-4o delivering the strongest performance and highlights safety concerns, containment strategies, and trade-offs between creativity and reliability. The study underscores the potential of LLM-based agents in autonomous pentesting while providing public benchmarks and a roadmap for safer, scalable future research.

Abstract

We introduce HackSynth, a novel Large Language Model (LLM)-based agent capable of autonomous penetration testing. HackSynth's dual-module architecture includes a Planner and a Summarizer, which enable it to generate commands and process feedback iteratively. To benchmark HackSynth, we propose two new Capture The Flag (CTF)-based benchmark sets utilizing the popular platforms PicoCTF and OverTheWire. These benchmarks include two hundred challenges across diverse domains and difficulties, providing a standardized framework for evaluating LLM-based penetration testing agents. Based on these benchmarks, extensive experiments are presented, analyzing the core parameters of HackSynth, including creativity (temperature and top-p) and token utilization. Multiple open source and proprietary LLMs were used to measure the agent's capabilities. The experiments show that the agent performed best with the GPT-4o model, better than what the GPT-4o's system card suggests. We also discuss the safety and predictability of HackSynth's actions. Our findings indicate the potential of LLM-based agents in advancing autonomous penetration testing and the importance of robust safeguards. HackSynth and the benchmarks are publicly available to foster research on autonomous cybersecurity solutions.

HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

TL;DR

Abstract

HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)