Table of Contents
Fetching ...

LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks

Andreas Happe, Aaron Kaplan, Juergen Cito

TL;DR

The paper addresses how capable LLMs are at autonomously performing Linux privilege-escalation and introduces hackingBuddyGPT along with a reproducible, open benchmark. By evaluating multiple models (GPT-3.5-Turbo, GPT-4-Turbo, Llama3 variants) against human pen-testers and traditional tools, and by exploring context-management and high-level guidance, the study reveals that GPT-4-Turbo achieves human-competitive success (up to ~83% with guidance) while smaller models lag significantly. It demonstrates that state compression and high-level guidance substantially boost performance and that context-size trade-offs and cost considerations critically shape practical deployment. The work provides a valuable, open-source platform for evaluating LLM-driven offensive security, offers nuanced insights into command quality and model behavior, and highlights ethical safeguards and threats to validity, informing both defense and responsible research in LLM-assisted penetration testing.

Abstract

Penetration-testing is crucial for identifying system vulnerabilities, with privilege-escalation being a critical subtask to gain elevated access to protected resources. Language Models (LLMs) presents new avenues for automating these security practices by emulating human behavior. However, a comprehensive understanding of LLMs' efficacy and limitations in performing autonomous Linux privilege-escalation attacks remains under-explored. To address this gap, we introduce hackingBuddyGPT, a fully automated LLM-driven prototype designed for autonomous Linux privilege-escalation. We curated a novel, publicly available Linux privilege-escalation benchmark, enabling controlled and reproducible evaluation. Our empirical analysis assesses the quantitative success rates and qualitative operational behaviors of various LLMs -- GPT-3.5-Turbo, GPT-4-Turbo, and Llama3 -- against baselines of human professional pen-testers and traditional automated tools. We investigate the impact of context management strategies, different context sizes, and various high-level guidance mechanisms on LLM performance. Results show that GPT-4-Turbo demonstrates high efficacy, successfully exploiting 33-83% of vulnerabilities, a performance comparable to human pen-testers (75%). In contrast, local models like Llama3 exhibited limited success (0-33%), and GPT-3.5-Turbo achieved moderate rates (16-50%). We show that both high-level guidance and state-management through LLM-driven reflection significantly boost LLM success rates. Qualitative analysis reveals both LLMs' strengths and weaknesses in generating valid commands and highlights challenges in common-sense reasoning, error handling, and multi-step exploitation, particularly with temporal dependencies. Cost analysis indicates that GPT-4-Turbo can achieve human-comparable performance at competitive costs, especially with optimized context management.

LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks

TL;DR

The paper addresses how capable LLMs are at autonomously performing Linux privilege-escalation and introduces hackingBuddyGPT along with a reproducible, open benchmark. By evaluating multiple models (GPT-3.5-Turbo, GPT-4-Turbo, Llama3 variants) against human pen-testers and traditional tools, and by exploring context-management and high-level guidance, the study reveals that GPT-4-Turbo achieves human-competitive success (up to ~83% with guidance) while smaller models lag significantly. It demonstrates that state compression and high-level guidance substantially boost performance and that context-size trade-offs and cost considerations critically shape practical deployment. The work provides a valuable, open-source platform for evaluating LLM-driven offensive security, offers nuanced insights into command quality and model behavior, and highlights ethical safeguards and threats to validity, informing both defense and responsible research in LLM-assisted penetration testing.

Abstract

Penetration-testing is crucial for identifying system vulnerabilities, with privilege-escalation being a critical subtask to gain elevated access to protected resources. Language Models (LLMs) presents new avenues for automating these security practices by emulating human behavior. However, a comprehensive understanding of LLMs' efficacy and limitations in performing autonomous Linux privilege-escalation attacks remains under-explored. To address this gap, we introduce hackingBuddyGPT, a fully automated LLM-driven prototype designed for autonomous Linux privilege-escalation. We curated a novel, publicly available Linux privilege-escalation benchmark, enabling controlled and reproducible evaluation. Our empirical analysis assesses the quantitative success rates and qualitative operational behaviors of various LLMs -- GPT-3.5-Turbo, GPT-4-Turbo, and Llama3 -- against baselines of human professional pen-testers and traditional automated tools. We investigate the impact of context management strategies, different context sizes, and various high-level guidance mechanisms on LLM performance. Results show that GPT-4-Turbo demonstrates high efficacy, successfully exploiting 33-83% of vulnerabilities, a performance comparable to human pen-testers (75%). In contrast, local models like Llama3 exhibited limited success (0-33%), and GPT-3.5-Turbo achieved moderate rates (16-50%). We show that both high-level guidance and state-management through LLM-driven reflection significantly boost LLM success rates. Qualitative analysis reveals both LLMs' strengths and weaknesses in generating valid commands and highlights challenges in common-sense reasoning, error handling, and multi-step exploitation, particularly with temporal dependencies. Cost analysis indicates that GPT-4-Turbo can achieve human-comparable performance at competitive costs, especially with optimized context management.
Paper Structure (56 sections, 7 figures, 10 tables)

This paper contains 56 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Typical benchmark flow controlled by our provided benchmark script. It utilizes standard UNIX tools such as Vagrant and Ansible to provision and configure the vulnerable virtual machines used within our benchmark. hackingBuddyGPT is invoked for every started virtual machine and performs autonomous privilege-escalation attacks. Results are reported back to the benchmark script which subsequently destroys the used testing-machines.
  • Figure 2: High-Level Overview of the used testbed and prototype (HackingBuddyGPT). The testbed on the left contains multiple vulnerable Linux virtual machines, each of which contains a single vulnerability. HackingBuddyGPT is connected to the testbed over SSH and its execution is triggered by a human penetration-tester. Its Main Module uses the next-command LLM-prompt to designate the next command which it then executes on the connected vulnerable virtual machine (in the testbed). It captures the results and forwards both to the State Management Subsystem which incorporates multiple configurable state management strategies. The diagram also includes two optional guidance mechanisms: the Enumeration Module initially executes a traditional Linux enumeration tool (lse.sh) against the vulnerable machine and derives multiple guidance hints through a LLM-call; High-Level Hints consist of a single hint per vulnerable VM. Depending on the configuration, this guidance is included in the next-command prompt and influences the generation of the to be executed commands.
  • Figure 3: The prompt (next-command) used to query a LLM for the next command to execute. It contains multiple variables which will we replaced during execution. The only mandatory variable is capabilities, all others are optional. Capabilities contains a list of available capabilities (default: execute-command and test-credentials), history contains the truncated history of executed commands, state is filled in with a summarized state, and guidance can contain either high-level or enumeration-derived hints.
  • Figure 4: The prompt (update-state) used to update the state/facts. Multiple variables are filled in during execution: facts contains the current state, typically stated as a list of facts. cmd includes the last executed shell command on the virtual machine and resp the respective command output.
  • Figure 5: Graph of accumulated context token usage over time (indicated by the rounds) for different LLMs. Colors indicate different test-cases and are identical in both graphs (also see Figure \ref{['context_tokens']}).
  • ...and 2 more figures