Table of Contents
Fetching ...

CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution

Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

TL;DR

CRAKEN tackles the challenge of keeping LLM cybersecurity agents up-to-date and capable of multi-step, knowledge-intensive tasks by introducing a knowledge-based execution framework that integrates a domain knowledge base, Self-RAG, and Graph-RAG. It deploys a planner-executor multi-agent architecture and a retrieval system to ground reasoning in external knowledge, and it demonstrates improved performance on NYU CTF Bench (around $22\%$ solved with Graph-RAG) and expanded MITRE ATT&CK technique coverage (approximately $25$-$30\%$ more techniques) over prior work. The results indicate that knowledge-based execution enhances vulnerability modeling and exploit planning across diverse domains, at a modest increase in compute cost. An open-source dataset of CTF writeups and an extensible architecture are provided to enable embedding new security knowledge into LLM-driven agents.

Abstract

Large Language Model (LLM) agents can automate cybersecurity tasks and can adapt to the evolving cybersecurity landscape without re-engineering. While LLM agents have demonstrated cybersecurity capabilities on Capture-The-Flag (CTF) competitions, they have two key limitations: accessing latest cybersecurity expertise beyond training data, and integrating new knowledge into complex task planning. Knowledge-based approaches that incorporate technical understanding into the task-solving automation can tackle these limitations. We present CRAKEN, a knowledge-based LLM agent framework that improves cybersecurity capability through three core mechanisms: contextual decomposition of task-critical information, iterative self-reflected knowledge retrieval, and knowledge-hint injection that transforms insights into adaptive attack strategies. Comprehensive evaluations with different configurations show CRAKEN's effectiveness in multi-stage vulnerability detection and exploitation compared to previous approaches. Our extensible architecture establishes new methodologies for embedding new security knowledge into LLM-driven cybersecurity agentic systems. With a knowledge database of CTF writeups, CRAKEN obtained an accuracy of 22% on NYU CTF Bench, outperforming prior works by 3% and achieving state-of-the-art results. On evaluation of MITRE ATT&CK techniques, CRAKEN solves 25-30% more techniques than prior work, demonstrating improved cybersecurity capabilities via knowledge-based execution. We make our framework open source to public https://github.com/NYU-LLM-CTF/nyuctf_agents_craken.

CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution

TL;DR

CRAKEN tackles the challenge of keeping LLM cybersecurity agents up-to-date and capable of multi-step, knowledge-intensive tasks by introducing a knowledge-based execution framework that integrates a domain knowledge base, Self-RAG, and Graph-RAG. It deploys a planner-executor multi-agent architecture and a retrieval system to ground reasoning in external knowledge, and it demonstrates improved performance on NYU CTF Bench (around solved with Graph-RAG) and expanded MITRE ATT&CK technique coverage (approximately - more techniques) over prior work. The results indicate that knowledge-based execution enhances vulnerability modeling and exploit planning across diverse domains, at a modest increase in compute cost. An open-source dataset of CTF writeups and an extensible architecture are provided to enable embedding new security knowledge into LLM-driven agents.

Abstract

Large Language Model (LLM) agents can automate cybersecurity tasks and can adapt to the evolving cybersecurity landscape without re-engineering. While LLM agents have demonstrated cybersecurity capabilities on Capture-The-Flag (CTF) competitions, they have two key limitations: accessing latest cybersecurity expertise beyond training data, and integrating new knowledge into complex task planning. Knowledge-based approaches that incorporate technical understanding into the task-solving automation can tackle these limitations. We present CRAKEN, a knowledge-based LLM agent framework that improves cybersecurity capability through three core mechanisms: contextual decomposition of task-critical information, iterative self-reflected knowledge retrieval, and knowledge-hint injection that transforms insights into adaptive attack strategies. Comprehensive evaluations with different configurations show CRAKEN's effectiveness in multi-stage vulnerability detection and exploitation compared to previous approaches. Our extensible architecture establishes new methodologies for embedding new security knowledge into LLM-driven cybersecurity agentic systems. With a knowledge database of CTF writeups, CRAKEN obtained an accuracy of 22% on NYU CTF Bench, outperforming prior works by 3% and achieving state-of-the-art results. On evaluation of MITRE ATT&CK techniques, CRAKEN solves 25-30% more techniques than prior work, demonstrating improved cybersecurity capabilities via knowledge-based execution. We make our framework open source to public https://github.com/NYU-LLM-CTF/nyuctf_agents_craken.

Paper Structure

This paper contains 13 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Architecture of CRAKEN composed of two parts: 1. Planner-Executor based xu2023rewoo multi-agent system, and 2. the iterative retrieval system for RAG on the knowledge database.
  • Figure 2: Graph-RAG Retrieval
  • Figure 3: Overlap of CTFs solved by three agents on NYU CTF Bench.
  • Figure 4: Transition diagram visualizing the RAG process.
  • Figure 5: CRAKEN exit analysis by category on Claude 3.5 Sonnet, Claude 3.7 Sonnet, GPT 4o and GPT 4.1. There are 5 type of exit cases udeshi2025d - Max Cost, Max Round, Solved, Give up, and Error.