LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks
Andreas Happe, Aaron Kaplan, Juergen Cito
TL;DR
The paper addresses how capable LLMs are at autonomously performing Linux privilege-escalation and introduces hackingBuddyGPT along with a reproducible, open benchmark. By evaluating multiple models (GPT-3.5-Turbo, GPT-4-Turbo, Llama3 variants) against human pen-testers and traditional tools, and by exploring context-management and high-level guidance, the study reveals that GPT-4-Turbo achieves human-competitive success (up to ~83% with guidance) while smaller models lag significantly. It demonstrates that state compression and high-level guidance substantially boost performance and that context-size trade-offs and cost considerations critically shape practical deployment. The work provides a valuable, open-source platform for evaluating LLM-driven offensive security, offers nuanced insights into command quality and model behavior, and highlights ethical safeguards and threats to validity, informing both defense and responsible research in LLM-assisted penetration testing.
Abstract
Penetration-testing is crucial for identifying system vulnerabilities, with privilege-escalation being a critical subtask to gain elevated access to protected resources. Language Models (LLMs) presents new avenues for automating these security practices by emulating human behavior. However, a comprehensive understanding of LLMs' efficacy and limitations in performing autonomous Linux privilege-escalation attacks remains under-explored. To address this gap, we introduce hackingBuddyGPT, a fully automated LLM-driven prototype designed for autonomous Linux privilege-escalation. We curated a novel, publicly available Linux privilege-escalation benchmark, enabling controlled and reproducible evaluation. Our empirical analysis assesses the quantitative success rates and qualitative operational behaviors of various LLMs -- GPT-3.5-Turbo, GPT-4-Turbo, and Llama3 -- against baselines of human professional pen-testers and traditional automated tools. We investigate the impact of context management strategies, different context sizes, and various high-level guidance mechanisms on LLM performance. Results show that GPT-4-Turbo demonstrates high efficacy, successfully exploiting 33-83% of vulnerabilities, a performance comparable to human pen-testers (75%). In contrast, local models like Llama3 exhibited limited success (0-33%), and GPT-3.5-Turbo achieved moderate rates (16-50%). We show that both high-level guidance and state-management through LLM-driven reflection significantly boost LLM success rates. Qualitative analysis reveals both LLMs' strengths and weaknesses in generating valid commands and highlights challenges in common-sense reasoning, error handling, and multi-step exploitation, particularly with temporal dependencies. Cost analysis indicates that GPT-4-Turbo can achieve human-comparable performance at competitive costs, especially with optimized context management.
