Table of Contents
Fetching ...

EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, Ofir Press

TL;DR

EnIGMA addresses the gap in LM agents solving cybersecurity tasks by introducing Interactive Agent Tools (IATs) and Summarizers, enabling non-blocking interaction with interactive tools (e.g., debuggers, remote servers) and concise handling of long outputs. It embeds these components in a SWE-agent-based framework and evaluates on 390 CTF challenges across NYU CTF, InterCode-CTF, CyBench, and HTB, achieving state-of-the-art results on NYU CTF and CyBench and strong gains on InterCode-CTF. The work also analyzes data leakage and the soliloquizing phenomenon, investigates extrapolation to unseen challenges, and releases open-source code and datasets to promote reproducibility and further progress. Together, these contributions advance autonomous LM-powered cybersecurity problem solving and provide a foundation for broader, safer deployment in security-critical domains.

Abstract

Although language model (LM) agents have demonstrated increased performance in multiple domains, including coding and web-browsing, their success in cybersecurity has been limited. We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. We introduce new tools and interfaces to improve the agent's ability to find and exploit security vulnerabilities, focusing on interactive terminal programs. These novel Interactive Agent Tools enable LM agents, for the first time, to run interactive utilities, such as a debugger and a server connection tool, which are essential for solving these challenges. Empirical analysis on 390 CTF challenges across four benchmarks demonstrate that these new tools and interfaces substantially improve our agent's performance, achieving state-of-the-art results on NYU CTF, Intercode-CTF, and CyBench. Finally, we analyze data leakage, developing new methods to quantify it and identifying a new phenomenon we term soliloquizing, where the model self-generates hallucinated observations without interacting with the environment. Our code and development dataset are available at https://github.com/SWE-agent/SWE-agent/tree/v0.7 and https://github.com/NYU-LLM-CTF/NYU_CTF_Bench/tree/main/development respectively.

EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

TL;DR

EnIGMA addresses the gap in LM agents solving cybersecurity tasks by introducing Interactive Agent Tools (IATs) and Summarizers, enabling non-blocking interaction with interactive tools (e.g., debuggers, remote servers) and concise handling of long outputs. It embeds these components in a SWE-agent-based framework and evaluates on 390 CTF challenges across NYU CTF, InterCode-CTF, CyBench, and HTB, achieving state-of-the-art results on NYU CTF and CyBench and strong gains on InterCode-CTF. The work also analyzes data leakage and the soliloquizing phenomenon, investigates extrapolation to unseen challenges, and releases open-source code and datasets to promote reproducibility and further progress. Together, these contributions advance autonomous LM-powered cybersecurity problem solving and provide a foundation for broader, safer deployment in security-critical domains.

Abstract

Although language model (LM) agents have demonstrated increased performance in multiple domains, including coding and web-browsing, their success in cybersecurity has been limited. We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. We introduce new tools and interfaces to improve the agent's ability to find and exploit security vulnerabilities, focusing on interactive terminal programs. These novel Interactive Agent Tools enable LM agents, for the first time, to run interactive utilities, such as a debugger and a server connection tool, which are essential for solving these challenges. Empirical analysis on 390 CTF challenges across four benchmarks demonstrate that these new tools and interfaces substantially improve our agent's performance, achieving state-of-the-art results on NYU CTF, Intercode-CTF, and CyBench. Finally, we analyze data leakage, developing new methods to quantify it and identifying a new phenomenon we term soliloquizing, where the model self-generates hallucinated observations without interacting with the environment. Our code and development dataset are available at https://github.com/SWE-agent/SWE-agent/tree/v0.7 and https://github.com/NYU-LLM-CTF/NYU_CTF_Bench/tree/main/development respectively.
Paper Structure (29 sections, 17 figures, 15 tables)

This paper contains 29 sections, 17 figures, 15 tables.

Figures (17)

  • Figure 1: EnIGMA is an LM agent fed with CTF challenges from the NYU CTF benchmark. It interacts with the computer through an environment that is built on top of SWE-agentyang2024sweagent and extends it to cybersecurity. We incorporate new interactive tools that assist the agent in debugging and connecting to remote server. The agent iterates through interactions and feedback from the environment until it solves the challenge.
  • Figure 2: Partial trajectory of EnIGMA (powered by GPT-4 Turbo) solving a reverse engineering challenge from the development set, where it uses the interactive interface to interact with the challenge server. After the first attempt to log in to the server fails, the agent returns to the main shell (bash) to find more clues about the password, while the connection to the challenge server remains open in the background. This is similar to how humans use computer systems.
  • Figure 3: Partial EnIGMA trajectories for a reverse engineering challenge to compare the summarizers. (a) The LM summarizer provides a detailed summary explaining the main function implementation along with a viable approach to solve the challenge. (b) The simple summarizer shows a window of the output saved in a file. (c) With no summarizer, the output is sent back to the LM and may fill up its entire context window, thereby immediately ending the session.
  • Figure 4: EnIGMA (powered by Claude 3.5 Sonnet) success and failure counts, stacked, by number of turns.
  • Figure 5: Analysis of debug action sequences performed by EnIGMA with Claude 3.5 Sonnet on reverse engineering tasks. Arrows point to an action called immediately after a previous action, with percentages quantifying the probabilities of these transitions (similar to a Markov chain). Numbers suffixed with $\times$ indicate the number of occurrences of the action or transition in the sample. For example, the agent used breakpoint 32 times in the sample, and in 75% of these calls (24 times), continue was the next action. Because debug actions can be followed by non-debug actions, only a subset of transitions is shown.
  • ...and 12 more figures