Table of Contents
Fetching ...

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, Varun Kumar

TL;DR

JAWS-Bench presents a three-regime, executable-aware benchmark for jailbreaking code-capable LLM agents, spanning empty, single-file, and multi-file workspaces. A hierarchical judge pipeline separates robustness (refusal/compliance and harm likelihood) from executability (syntax and runtime viability), enabling end-to-end assessment of deployable malware potential. Across seven back-end LLMs, results show a steep increase in attack success and deployable code as workspace context grows, with agent-wrapping amplifying vulnerability by roughly 1.6x and multi-file contexts yielding substantial proportions of parseable and runnable malicious artifacts. The study highlights the need for execution-time safeguards, code-contextual safety filters, and persistence of refusal decisions across planning and tool-use steps to mitigate real-world risks of autonomous code agents. It provides a reproducible framework for evaluating defenses and guiding safer deployment of code agents in software engineering workflows.

Abstract

Code-capable large language model (LLM) agents are increasingly embedded into software engineering workflows where they can read, write, and execute code, raising the stakes of safety-bypass ("jailbreak") attacks beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection, leaving open whether agents actually compile and run malicious programs. We present JAWS-BENCH (Jailbreaks Across WorkSpaces), a benchmark spanning three escalating workspace regimes that mirror attacker capability: empty (JAWS-0), single-file (JAWS-1), and multi-file (JAWS-M). We pair this with a hierarchical, executable-aware Judge Framework that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, moving beyond refusal to measure deployable harm. Using seven LLMs from five families as backends, we find that under prompt-only conditions in JAWS-0, code agents accept 61% of attacks on average; 58% are harmful, 52% parse, and 27% run end-to-end. Moving to single-file regime in JAWS-1 drives compliance to ~ 100% for capable models and yields a mean ASR (Attack Success Rate) ~ 71%; the multi-file regime (JAWS-M) raises mean ASR to ~ 75%, with 32% instantly deployable attack code. Across models, wrapping an LLM in an agent substantially increases vulnerability -- ASR raises by 1.6x -- because initial refusals are frequently overturned during later planning/tool-use steps. Category-level analyses identify which attack classes are most vulnerable and most readily deployable, while others exhibit large execution gaps. These findings motivate execution-aware defenses, code-contextual safety filters, and mechanisms that preserve refusal decisions throughout the agent's multi-step reasoning and tool use.

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

TL;DR

JAWS-Bench presents a three-regime, executable-aware benchmark for jailbreaking code-capable LLM agents, spanning empty, single-file, and multi-file workspaces. A hierarchical judge pipeline separates robustness (refusal/compliance and harm likelihood) from executability (syntax and runtime viability), enabling end-to-end assessment of deployable malware potential. Across seven back-end LLMs, results show a steep increase in attack success and deployable code as workspace context grows, with agent-wrapping amplifying vulnerability by roughly 1.6x and multi-file contexts yielding substantial proportions of parseable and runnable malicious artifacts. The study highlights the need for execution-time safeguards, code-contextual safety filters, and persistence of refusal decisions across planning and tool-use steps to mitigate real-world risks of autonomous code agents. It provides a reproducible framework for evaluating defenses and guiding safer deployment of code agents in software engineering workflows.

Abstract

Code-capable large language model (LLM) agents are increasingly embedded into software engineering workflows where they can read, write, and execute code, raising the stakes of safety-bypass ("jailbreak") attacks beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection, leaving open whether agents actually compile and run malicious programs. We present JAWS-BENCH (Jailbreaks Across WorkSpaces), a benchmark spanning three escalating workspace regimes that mirror attacker capability: empty (JAWS-0), single-file (JAWS-1), and multi-file (JAWS-M). We pair this with a hierarchical, executable-aware Judge Framework that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, moving beyond refusal to measure deployable harm. Using seven LLMs from five families as backends, we find that under prompt-only conditions in JAWS-0, code agents accept 61% of attacks on average; 58% are harmful, 52% parse, and 27% run end-to-end. Moving to single-file regime in JAWS-1 drives compliance to ~ 100% for capable models and yields a mean ASR (Attack Success Rate) ~ 71%; the multi-file regime (JAWS-M) raises mean ASR to ~ 75%, with 32% instantly deployable attack code. Across models, wrapping an LLM in an agent substantially increases vulnerability -- ASR raises by 1.6x -- because initial refusals are frequently overturned during later planning/tool-use steps. Category-level analyses identify which attack classes are most vulnerable and most readily deployable, while others exhibit large execution gaps. These findings motivate execution-aware defenses, code-contextual safety filters, and mechanisms that preserve refusal decisions throughout the agent's multi-step reasoning and tool use.

Paper Structure

This paper contains 36 sections, 3 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: Overview. Our end-to-end evaluation pipeline across the three workspace regimes -- JAWS-0 (Empty), JAWS-1 (Single-File), and JAWS-M (Multi-File) --which mirror naive, capable, and expert attacker settings. In JAWS-0, attacker supplies only a textual prompt; in JAWS-1, a single malicious file contains a <FILL_HERE> region for completion; in JAWS-M, malicious logic is distributed across modules with one function body removed (in worm.py) for cross-file completion. Each scenario passes through the same judge framework: an LLM-based robustness layer (Refusal Judge $\rightarrow$ Attack Evaluation Judge) and an Agentic executability layer (Syntax-Error Judge $\rightarrow$ Runtime-Error Judge). The stacked outcomes (Not Refused $\rightarrow$ Harmful $\rightarrow$ Parsable $\rightarrow$ Executable) quantify how many responses progress from policy violation to deployable malicious code.
  • Figure 2: JAWS-0 (Empty) results. Multi-stage judge outcomes for the empty-workspace regime. Higher values indicate greater jailbreak risk; darker shades denote stricter judges.
  • Figure 3: Jailbreak rate for different malicious categories in JAWS-Bench. Full breakdown in Table \ref{['tab:cat_analysis']}.
  • Figure 4: JAWS-0 (Empty Workspace)
  • Figure 5: JAWS-1 (Single-File Workspace)
  • ...and 15 more figures