DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Chiyu Zhang; Marc-Alexandre Cote; Michael Albada; Anush Sankaran; Jack W. Stokes; Tong Wang; Amir Abdi; William Blum; Muhammad Abdul-Mageed

DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Chiyu Zhang, Marc-Alexandre Cote, Michael Albada, Anush Sankaran, Jack W. Stokes, Tong Wang, Amir Abdi, William Blum, Muhammad Abdul-Mageed

TL;DR

DefenderBench tackles the lack of robust, cross-model evaluation for LLM-based cybersecurity agents by providing an open-source, modular toolkit that spans offense, defense, and cybersecurity knowledge tasks. It converts five cybersecurity tasks into interactive environments (network intrusion, malicious content detection, CTI MCQA, code vulnerability detection, and code vulnerability fixing) and uses a unified DefenderBench score to enable fair comparisons across models. Key findings show Claude-3.7-sonnet as the top performer, with notable gains from augmentations and CoT prompting in interactive tasks, while open-weight models lag on complex vulnerability tasks. The work offers a practical, reproducible platform for future research and deployment considerations in LLM-enabled cybersecurity, with clear avenues for expanding tasks and model coverage.

Abstract

Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.

DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

TL;DR

Abstract

DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)