An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

Minghao Shao; Boyuan Chen; Sofija Jancheska; Brendan Dolan-Gavitt; Siddharth Garg; Ramesh Karri; Muhammad Shafique

An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

Minghao Shao, Boyuan Chen, Sofija Jancheska, Brendan Dolan-Gavitt, Siddharth Garg, Ramesh Karri, Muhammad Shafique

TL;DR

The study evaluates how well large language models can solve real CTF challenges using HITL and fully automated workflows, comparing results to human teams. It benchmarks six LLMs across 26 challenges, analyzes strengths and failure modes, and demonstrates that GPT-4 in particular can approach or surpass average human performance in automated settings. The findings highlight substantial potential for LLM-enabled cybersecurity education and automated problem-solving while underscoring the ongoing need for human oversight and prompt/tooling enhancements. Overall, the work provides a practical framework and data for systematically assessing offensive cybersecurity capabilities of LLMs.

Abstract

Capture The Flag (CTF) challenges are puzzles related to computer security scenarios. With the advent of large language models (LLMs), more and more CTF participants are using LLMs to understand and solve the challenges. However, so far no work has evaluated the effectiveness of LLMs in solving CTF challenges with a fully automated workflow. We develop two CTF-solving workflows, human-in-the-loop (HITL) and fully-automated, to examine the LLMs' ability to solve a selected set of CTF challenges, prompted with information about the question. We collect human contestants' results on the same set of questions, and find that LLMs achieve higher success rate than an average human participant. This work provides a comprehensive evaluation of the capability of LLMs in solving real world CTF challenges, from real competition to fully automated workflow. Our results provide references for applying LLMs in cybersecurity education and pave the way for systematic evaluation of offensive cybersecurity capabilities in LLMs.

An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

TL;DR

Abstract

Paper Structure (63 sections, 7 figures, 8 tables)

This paper contains 63 sections, 7 figures, 8 tables.

Introduction
Background
Capture the Flag (CTF)
Application of CTFs
Problem Categories
CTF Platforms
LLMs and Conversational AI
Methodology
LLM-Guided CTF
Automated Framework Evaluation
Selected LLMs
Automated LLM Workflow for Solving CTFs
Evaluation with Tool Use
HITL Evaluation
Experimental Results
...and 48 more sections

Figures (7)

Figure 1: LLM-Guided CTF Workflow: 1) Contestants are allowed to refer to outside knowledge such as web search or group discussion. 2) The contestants get challenges from the database, compose their own prompts with their understanding; then, 3) they feed all information needed to the LLM, 4) get answers from the LLM, 5) validate answers manually, 6) finish the process if the answer is correct, and if not 7) give the feedback to the LLM, or judge if it should be given up, and finally 8) may give up based on human judgement. This is similar to the HITL evaluation process \ref{['fig:human_eval']} described later in this section, but during the LLM aided CTF competition, participants are allowed to use external help to solve the CTF challenge based on the assistance of a large language model, such as referring to external guidelines of CTF competition, but the solution or solver script must come from LLMs with providing dialogue history as proof.
Figure 2: Fully automated workflow for solving CTFs: 1) Set up a pre-defined prompt template; 2) Format initial prompt based on the challenge, apply tools in the tool chain based on LLM judgement or pre-defined behavior; 3) CTF Player environment is dockerized with all necessary toolkits installed; 4) Feed formatted prompts to the LLM. ; 5) LLM returns answer for each prompt; 6) LLM interacts with the player Docker container. With the assistance of built-in validation tools, validate the solution; 7) LLM accepts output from previous step and gives the output or combined with Chain-of-Thought as feedback; 8) Decision based on LLM's judgement if correct flag was returned or it should give up.
Figure 3: Example prompt for fully automated workflow.
Figure 4: HITL workflow: 1) An initial prompt template is formatted with the information provided by the challenge; 2) The formatted prompt is sent to LLM system; 3) LLM system returns answer of each prompt; 4) Validation of the answer by a human judge; 5) Finish the process if the answer is correct; 6) If the answer is not correct, give human feedback based on expertise and return to LLM for the next dialogue; 7) Give up or count as failure based on human judgement. Different from \ref{['fig:competition']}, the testers are regarded as CTF expertise and outside knowledge is inaccessible in that workflow.
Figure 5: GPT 4 automatically solving the "baby's third", a reverse engineering challenge, using the automated workflow. Non-anonymous information is masked out.
...and 2 more figures

An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

TL;DR

Abstract

An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

Authors

TL;DR

Abstract

Table of Contents

Figures (7)