Table of Contents
Fetching ...

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents

Lei Ba, Qinbin Li, Songze Li

TL;DR

CIBER is introduced, an automated benchmark that combines dynamic attack generation, isolated secure sandboxing, and state-aware evaluation to systematically assess the vulnerability of code interpreter agents against four major types of adversarial attacks: Direct/Indirect Prompt Injection, Memory Poisoning, and Prompt-based Backdoor.

Abstract

LLM-based code interpreter agents are increasingly deployed in critical workflows, yet their robustness against risks introduced by their code execution capabilities remains underexplored. Existing benchmarks are limited to static datasets or simulated environments, failing to capture the security risks arising from dynamic code execution, tool interactions, and multi-turn context. To bridge this gap, we introduce CIBER, an automated benchmark that combines dynamic attack generation, isolated secure sandboxing, and state-aware evaluation to systematically assess the vulnerability of code interpreter agents against four major types of adversarial attacks: Direct/Indirect Prompt Injection, Memory Poisoning, and Prompt-based Backdoor. We evaluate six foundation models across two representative code interpreter agents (OpenInterpreter and OpenCodeInterpreter), incorporating a controlled study of identical models. Our results reveal that Interpreter Architecture and Model Alignment Set the Security Baseline. Structural integration enables aligned specialized models to outperform generic SOTA models. Conversely, high intelligence paradoxically increases susceptibility to complex adversarial prompts due to stronger instruction adherence. Furthermore, we identify a "Natural Language Disguise" Phenomenon, where natural language functions as a significantly more effective input modality than explicit code snippets (+14.1% ASR), thereby bypassing syntax-based defenses. Finally, we expose an alarming Security Polarization, where agents exhibit robust defenses against explicit threats yet fail catastrophically against implicit semantic hazards, highlighting a fundamental blind spot in current pattern-matching protection approaches.

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents

TL;DR

CIBER is introduced, an automated benchmark that combines dynamic attack generation, isolated secure sandboxing, and state-aware evaluation to systematically assess the vulnerability of code interpreter agents against four major types of adversarial attacks: Direct/Indirect Prompt Injection, Memory Poisoning, and Prompt-based Backdoor.

Abstract

LLM-based code interpreter agents are increasingly deployed in critical workflows, yet their robustness against risks introduced by their code execution capabilities remains underexplored. Existing benchmarks are limited to static datasets or simulated environments, failing to capture the security risks arising from dynamic code execution, tool interactions, and multi-turn context. To bridge this gap, we introduce CIBER, an automated benchmark that combines dynamic attack generation, isolated secure sandboxing, and state-aware evaluation to systematically assess the vulnerability of code interpreter agents against four major types of adversarial attacks: Direct/Indirect Prompt Injection, Memory Poisoning, and Prompt-based Backdoor. We evaluate six foundation models across two representative code interpreter agents (OpenInterpreter and OpenCodeInterpreter), incorporating a controlled study of identical models. Our results reveal that Interpreter Architecture and Model Alignment Set the Security Baseline. Structural integration enables aligned specialized models to outperform generic SOTA models. Conversely, high intelligence paradoxically increases susceptibility to complex adversarial prompts due to stronger instruction adherence. Furthermore, we identify a "Natural Language Disguise" Phenomenon, where natural language functions as a significantly more effective input modality than explicit code snippets (+14.1% ASR), thereby bypassing syntax-based defenses. Finally, we expose an alarming Security Polarization, where agents exhibit robust defenses against explicit threats yet fail catastrophically against implicit semantic hazards, highlighting a fundamental blind spot in current pattern-matching protection approaches.
Paper Structure (24 sections, 7 equations, 6 figures, 4 tables)

This paper contains 24 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the CIBER Framework. The pipeline integrates three modules: (Left) Attack Generation creates multi-modal adversarial inputs (Natural Language, Code Descriptions, Code Snippets) across four attack methods. (Center) Isolated Execution hosts the agent in a Docker sandbox, capturing both textual responses and physical environment changes. (Right) State-Aware Evaluation applies a three-step verification logic (Intent Analysis $\mathcal{A}$ + State Verification $\mathcal{V}$) to classify security risks into the final Scoring Matrix.
  • Figure 2: Security Scenario-Domain Mapping for RedCode Dataset (25 scenarios across 8 domains)
  • Figure 3: Security Landscape under Natural Language Inputs. Comprehensive heatmap showing Attack Success Rates (left) and Reject Rates (right) across different scenarios and attack methods. Results are based on the NL baseline to isolate architectural impacts. Scenarios are sorted by average success rate from high to low.
  • Figure 4: Model Vulnerability by Input Modality (DPI Baseline). (a) ASR: Natural Language Disguise ($\text{CD} > \text{CS}$) consistently bypasses filters. (b) RR: Explicit code triggers the Pattern-Matching Trap (high refusals), with GPT-5-mini as a notable outlier.
  • Figure 5: Security Score ($SS = RR - ASR$) Layer Analysis.The evaluation reveals three distinct vulnerability layers separated by natural breakpoints at $SS \in \{0, -50\}$. While agents defend well against clear threats (Layer I), they fail constantly in cases requiring special knowledge (Layer II) or understanding intent (Layer III).
  • ...and 1 more figures