ARACNE: An LLM-Based Autonomous Shell Pentesting Agent
Tomas Nieponice, Veronica Valeros, Sebastian Garcia
TL;DR
ARACNE tackles autonomous pentesting by introducing a modular, multi-LLM architecture that separates planning and command execution for SSH shell interactions. The planner, interpreter, optional summarizer, and core agent collaborate to generate, translate, and execute Linux commands on target systems with context-aware iteration. Empirical results against ShelLM and the Over The Wire Bandit challenges show competitive success rates (60% and 57.58%, respectively) and typically small action counts on successful runs, surpassing prior state-of-the-art in Bandit. The work demonstrates the feasibility of using specialized LLMs for autonomous offensive testing and discusses guardrails, jailbreaks, ethical considerations, and directions for integration with standard security tooling. Overall, ARACNE advances autonomous LLM-driven cybersecurity research by offering a flexible, extensible framework that can evolve with evolving models and defense mechanisms.
Abstract
We introduce ARACNE, a fully autonomous LLM-based pentesting agent tailored for SSH services that can execute commands on real Linux shell systems. Introduces a new agent architecture with multi-LLM model support. Experiments show that ARACNE can reach a 60\% success rate against the autonomous defender ShelLM and a 57.58\% success rate against the Over The Wire Bandit CTF challenges, improving over the state-of-the-art. When winning, the average number of actions taken by the agent to accomplish the goals was less than 5. The results show that the use of multi-LLM is a promising approach to increase accuracy in the actions.
