Table of Contents
Fetching ...

NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

TL;DR

<3-5 sentence high-level summary>To address the lack of scalable benchmarks for evaluating LLMs on cybersecurity Capture-the-Flag tasks, the authors introduce NYU CTF Bench, an open, CSAW-derived dataset paired with an automated evaluation framework. The dataset comprises 200 challenges across cryptography, forensics, pwn, rev, web, and misc, standardized with Docker-based deployment and JSON metadata. The framework integrates multiple backends and external cybersecurity tools via function calling to autonomously solve challenges. Experiments indicate GPT-4 achieves the strongest performance among tested models but reveal notable limitations in open-source models and emphasize ethical considerations in offensive-security applications.

Abstract

Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LLM testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving the efficiency of LLMs in interactive cybersecurity tasks and automated task planning. By providing a specialized benchmark, our project offers an ideal platform for developing, testing, and refining LLM-based approaches to vulnerability detection and resolution. Evaluating LLMs on these challenges and comparing with human performance yields insights into their potential for AI-driven cybersecurity solutions to perform real-world threat management. We make our benchmark dataset open source to public https://github.com/NYU-LLM-CTF/NYU_CTF_Bench along with our playground automated framework https://github.com/NYU-LLM-CTF/llm_ctf_automation.

NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

TL;DR

<3-5 sentence high-level summary>To address the lack of scalable benchmarks for evaluating LLMs on cybersecurity Capture-the-Flag tasks, the authors introduce NYU CTF Bench, an open, CSAW-derived dataset paired with an automated evaluation framework. The dataset comprises 200 challenges across cryptography, forensics, pwn, rev, web, and misc, standardized with Docker-based deployment and JSON metadata. The framework integrates multiple backends and external cybersecurity tools via function calling to autonomously solve challenges. Experiments indicate GPT-4 achieves the strongest performance among tested models but reveal notable limitations in open-source models and emphasize ethical considerations in offensive-security applications.

Abstract

Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LLM testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving the efficiency of LLMs in interactive cybersecurity tasks and automated task planning. By providing a specialized benchmark, our project offers an ideal platform for developing, testing, and refining LLM-based approaches to vulnerability detection and resolution. Evaluating LLMs on these challenges and comparing with human performance yields insights into their potential for AI-driven cybersecurity solutions to perform real-world threat management. We make our benchmark dataset open source to public https://github.com/NYU-LLM-CTF/NYU_CTF_Bench along with our playground automated framework https://github.com/NYU-LLM-CTF/llm_ctf_automation.
Paper Structure (21 sections, 8 figures, 13 tables)

This paper contains 21 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Distribution of Challenge Difficulties in Qualifying and Final Rounds.
  • Figure 2: NYU CTF Data Structure.
  • Figure 3: Architecture of the automated CTF solution framework.
  • Figure 4: Example of Default Prompt Format Used in the Framework.
  • Figure 5: LLM Solver Excerpts for the "Puffin" Pwn Challenge in Table \ref{['tab:challenges']}.
  • ...and 3 more figures