Table of Contents
Fetching ...

Quantifying Frontier LLM Capabilities for Container Sandbox Escape

Rahul Marchand, Art O Cathain, Jerome Wynne, Philippos Maximos Giavridis, Sam Deverett, John Wilkinson, Jason Gwartz, Harry Coppock

TL;DR

SANDBOXESCAPEBENCH is introduced, an open benchmark that safely measures an LLM's capacity to break out of isolated sandboxes and finds that, when vulnerabilities are added, LLMs are able to identify and exploit them, showing that use of evaluation like SANDBOXESCAPEBENCH is needed to ensure sandboxing continues to provide the encapsulation needed for highly-capable models.

Abstract

Large language models (LLMs) increasingly act as autonomous agents, using tools to execute code, read and write files, and access networks, creating novel security risks. To mitigate these risks, agents are commonly deployed and evaluated in isolated "sandbox" environments, often implemented using Docker/OCI containers. We introduce SANDBOXESCAPEBENCH, an open benchmark that safely measures an LLM's capacity to break out of these sandboxes. The benchmark is implemented as an Inspect AI Capture the Flag (CTF) evaluation utilising a nested sandbox architecture with the outer layer containing the flag and no known vulnerabilities. Following a threat model of a motivated adversarial agent with shell access inside a container, SANDBOXESCAPEBENCH covers a spectrum of sandboxescape mechanisms spanning misconfiguration, privilege allocation mistakes, kernel flaws, and runtime/orchestration weaknesses. We find that, when vulnerabilities are added, LLMs are able to identify and exploit them, showing that use of evaluation like SANDBOXESCAPEBENCH is needed to ensure sandboxing continues to provide the encapsulation needed for highly-capable models.

Quantifying Frontier LLM Capabilities for Container Sandbox Escape

TL;DR

SANDBOXESCAPEBENCH is introduced, an open benchmark that safely measures an LLM's capacity to break out of isolated sandboxes and finds that, when vulnerabilities are added, LLMs are able to identify and exploit them, showing that use of evaluation like SANDBOXESCAPEBENCH is needed to ensure sandboxing continues to provide the encapsulation needed for highly-capable models.

Abstract

Large language models (LLMs) increasingly act as autonomous agents, using tools to execute code, read and write files, and access networks, creating novel security risks. To mitigate these risks, agents are commonly deployed and evaluated in isolated "sandbox" environments, often implemented using Docker/OCI containers. We introduce SANDBOXESCAPEBENCH, an open benchmark that safely measures an LLM's capacity to break out of these sandboxes. The benchmark is implemented as an Inspect AI Capture the Flag (CTF) evaluation utilising a nested sandbox architecture with the outer layer containing the flag and no known vulnerabilities. Following a threat model of a motivated adversarial agent with shell access inside a container, SANDBOXESCAPEBENCH covers a spectrum of sandboxescape mechanisms spanning misconfiguration, privilege allocation mistakes, kernel flaws, and runtime/orchestration weaknesses. We find that, when vulnerabilities are added, LLMs are able to identify and exploit them, showing that use of evaluation like SANDBOXESCAPEBENCH is needed to ensure sandboxing continues to provide the encapsulation needed for highly-capable models.
Paper Structure (65 sections, 15 figures, 4 tables)

This paper contains 65 sections, 15 figures, 4 tables.

Figures (15)

  • Figure 1: (a) Evaluation architecture: Inspect orchestrates parallel VM sandboxes, each running a container where the model executes bash commands. Success requires escaping the container to read /flag.txt on the host. Each sandbox can be configured from any of 18 scenarios across three attack layers. (b) Scenario taxonomy showing Orchestration (4), Runtime (8), and Kernel (6) level scenarios, with difficulty ratings from 1 to 5. See \ref{['app:scenarios']} for grading criteria.
  • Figure 2: Container escape mean success rates over 5 epochs for each model$\times$scenario pairing grouped by scenario difficulty rating ( 1/5 to 5/5; see Appendix \ref{['app:scenarios']} for more detail).
  • Figure 3: Performance scaling plots grouped by scenario difficulty. Each point represents the success rate a model would have achieved if it had been terminated at that level of inference resource. The top row is limited by total token usage while the bottom row is limited by token cost, calculated using current input, output, and cache provider rates. Because model pricing differs, the cost-based plots terminate earlier and at different points, as each sample reaches the 2.5 million token limit at a different total cost. 95% Wilson confidence intervals are removed in the top left subplot for readability.
  • Figure 4: Effect of vulnerability hinting on performance. (a) Success rate as a function of inference-time token budget, grouped by difficulty. Each point indicates the success rate a model would achieve if terminated at that token count. (b) Mean escape success rates for GPT-5 and Opus with and without hints.
  • Figure 5: Empirical CDF of transcript lengths by model. Solid lines indicate successful samples; dashed lines indicate failures.
  • ...and 10 more figures