Table of Contents
Fetching ...

PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities

Zicheng Liu, Lige Huang, Jie Zhang, Dongrui Liu, Yuan Tian, Jing Shao

TL;DR

PACEbench introduces a realistic, CVE-driven benchmark for evaluating autonomous AI cyber-exploitation capabilities and addresses limitations of existing CTF benchmarks by integrating vulnerability difficulty, environmental complexity, and cyber defenses. The framework is paired with PACEagent, a modular, three-phase agent (reconnaissance, analysis, exploitation) that uses a Model Context Protocol to orchestrate tools and maintain memory for long-horizon tasks. Across seven frontier LLMs, experiments show limited capability to autonomously perform complex multi-host attacks and no bypass of defenses, establishing a baseline that current models do not yet pose a generalized cyber-offense threat. The work provides a structured methodology and quantifiable metrics (Pass@5 and BenchScore) to guide safe advancement and robust evaluation of AI-assisted cyber operations.

Abstract

The increasing autonomy of Large Language Models (LLMs) necessitates a rigorous evaluation of their potential to aid in cyber offense. Existing benchmarks often lack real-world complexity and are thus unable to accurately assess LLMs' cybersecurity capabilities. To address this gap, we introduce PACEbench, a practical AI cyber-exploitation benchmark built on the principles of realistic vulnerability difficulty, environmental complexity, and cyber defenses. Specifically, PACEbench comprises four scenarios spanning single, blended, chained, and defense vulnerability exploitations. To handle these complex challenges, we propose PACEagent, a novel agent that emulates human penetration testers by supporting multi-phase reconnaissance, analysis, and exploitation. Extensive experiments with seven frontier LLMs demonstrate that current models struggle with complex cyber scenarios, and none can bypass defenses. These findings suggest that current models do not yet pose a generalized cyber offense threat. Nonetheless, our work provides a robust benchmark to guide the trustworthy development of future models.

PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities

TL;DR

PACEbench introduces a realistic, CVE-driven benchmark for evaluating autonomous AI cyber-exploitation capabilities and addresses limitations of existing CTF benchmarks by integrating vulnerability difficulty, environmental complexity, and cyber defenses. The framework is paired with PACEagent, a modular, three-phase agent (reconnaissance, analysis, exploitation) that uses a Model Context Protocol to orchestrate tools and maintain memory for long-horizon tasks. Across seven frontier LLMs, experiments show limited capability to autonomously perform complex multi-host attacks and no bypass of defenses, establishing a baseline that current models do not yet pose a generalized cyber-offense threat. The work provides a structured methodology and quantifiable metrics (Pass@5 and BenchScore) to guide safe advancement and robust evaluation of AI-assisted cyber operations.

Abstract

The increasing autonomy of Large Language Models (LLMs) necessitates a rigorous evaluation of their potential to aid in cyber offense. Existing benchmarks often lack real-world complexity and are thus unable to accurately assess LLMs' cybersecurity capabilities. To address this gap, we introduce PACEbench, a practical AI cyber-exploitation benchmark built on the principles of realistic vulnerability difficulty, environmental complexity, and cyber defenses. Specifically, PACEbench comprises four scenarios spanning single, blended, chained, and defense vulnerability exploitations. To handle these complex challenges, we propose PACEagent, a novel agent that emulates human penetration testers by supporting multi-phase reconnaissance, analysis, and exploitation. Extensive experiments with seven frontier LLMs demonstrate that current models struggle with complex cyber scenarios, and none can bypass defenses. These findings suggest that current models do not yet pose a generalized cyber offense threat. Nonetheless, our work provides a robust benchmark to guide the trustworthy development of future models.

Paper Structure

This paper contains 38 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An overview of PACEbench. In this benchmark, an agent's score is a function of both task-specific difficulty and the complexity of the scenario, which scales from isolated vulnerabilities to complex environments.
  • Figure 2: Comparison of cybersecurity benchmarks. Based on the principles of vulnerability difficulty, environment complexity, and cyber defenses, our benchmark (center) incorporates complex elements like a WAF and multiple hosts, offering a more realistic simulation of real-world (left) than traditional CTFs (right).
  • Figure 3: The architecture of the PACEagent framework. The red line illustrates the conventional external ReAct loop. The components shown in black are our novel enhancements for cybersecurity operations, featuring a phase manager to control the agent core's state, a tools router for tool orchestration, and a memory module to improve efficiency and prevent repetition.
  • Figure 4: Count of successful exploiting model across CVE difficulty levels, as measured by human pass rate.
  • Figure 5: Performance comparison between our PACEagent and the CAI.
  • ...and 2 more figures