Table of Contents
Fetching ...

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, Bo Li

TL;DR

<3-5 sentence high-level summary> RedCode introduces a real-system safety benchmark for code agents, addressing safety risks in both risky code execution (RedCode-Exec) and malicious software generation (RedCode-Gen). It evaluates agents across Python, Bash, and natural-language prompts in Dockerized environments, yielding 4,050 RedCode-Exec and 160 RedCode-Gen test cases. The study analyzes 19 code-agent frameworks and 19 LLMs, revealing high attack success rates, domain-specific safety gaps (notably OS and file systems), and language/model-dependent effects on safety. The authors provide a dataset and code repository and call for stronger, multi-format safety evaluations to guide safer code agents in practice.

Abstract

With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at https://github.com/AI-secure/RedCode.

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

TL;DR

<3-5 sentence high-level summary> RedCode introduces a real-system safety benchmark for code agents, addressing safety risks in both risky code execution (RedCode-Exec) and malicious software generation (RedCode-Gen). It evaluates agents across Python, Bash, and natural-language prompts in Dockerized environments, yielding 4,050 RedCode-Exec and 160 RedCode-Gen test cases. The study analyzes 19 code-agent frameworks and 19 LLMs, revealing high attack success rates, domain-specific safety gaps (notably OS and file systems), and language/model-dependent effects on safety. The authors provide a dataset and code repository and call for stronger, multi-format safety evaluations to guide safer code agents in practice.

Abstract

With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at https://github.com/AI-secure/RedCode.

Paper Structure

This paper contains 55 sections, 14 figures, 10 tables, 1 algorithm.

Figures (14)

  • Figure 1: Safety evaluation of 19 code agents on RedCode-Exec Python and Bash test cases under agent frameworks CodeAct, ReAct, and OpenCodeInterpreter (OCI). Among the evaluated LLMs, * demotes the fine-tuned LLMs released from OCI and CodeAct.
  • Figure 2: Overview of RedCode benchmark curation and evaluation where RedCode-Exec assess risks associated with code execution (\ref{['sec:exec_benchmark']}), and RedCode-Gen evaluates risks in software generation (\ref{['sec:gen_benchmark']}).
  • Figure 3: 25 risky scenarios spanning 8 domains in RedCode-Exec.
  • Figure 4: Curation overview of RedCode-Exec. We map the risky Python snippets into Bash snippets and corresponding text summaries or descriptions to form the final risky test cases.
  • Figure 5: Rejection rate and attack success rate of 8 representative code agents on RedCode-Exec Python test cases in 8 domains. The overall attack success rate is high, indicating existing code agents are vulnerable. Agents achieve a higher rejection rate on risky cases in operating and file systems than those in other domains.
  • ...and 9 more figures