Table of Contents
Fetching ...

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

Changyi Li, Pengfei Lu, Xudong Pan, Fazl Barez, Min Yang

TL;DR

This work presents AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling, which achieves over 98% end-to-end success and 60% human preference over existing simulators.

Abstract

As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade-off: manual benchmarks are costly, while LLM-based simulators are scalable but suffer from logic hallucination. We present AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility. This principle, instantiated through a three-agent framework, achieves over 98% end-to-end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X-Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

TL;DR

This work presents AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling, which achieves over 98% end-to-end success and 60% human preference over existing simulators.

Abstract

As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade-off: manual benchmarks are costly, while LLM-based simulators are scalable but suffer from logic hallucination. We present AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility. This principle, instantiated through a three-agent framework, achieves over 98% end-to-end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X-Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.
Paper Structure (79 sections, 18 figures, 10 tables)

This paper contains 79 sections, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Existing methods trade off fidelity for scalability: Manual Benchmarks are high-fidelity but labor-intensive, while In-Context Simulators are scalable but suffer from logic hallucination. AutoControl Arena achieves the Pareto frontier by combining executable ground-truth with generative flexibility.
  • Figure 2: AutoControl Arena Framework. The Architect transforms user intent into structured design specifications; the Coder synthesizes executable Python environments; and the Monitor performs behavioral auditing and generates comprehensive risk reports.
  • Figure 3: Logic-Narrative Bridge Mechanism. The framework decomposes environment interactions into deterministic logic checks (Kernel) and context-aware content generation (Narrative Layer).
  • Figure 4: X-Bench Overview.Left: The 70 risk scenarios organized into 7 categories and the 2$\times$2 Elicitation Matrix used to synthesize distinct test cases from each scenario. Right: Domain distribution breakdown across 15 operational sectors.
  • Figure 5: Real-to-Sim Verification: AutoControl Arena reproduces sophisticated alignment failures observed in frontier models. Left: The blackmailing agent from the Claude 4 System Card anthropic2024claude4. Middle: Reward hacking behavior seen in OpenAI's o1 model openai2025cot. Right: In-context scheming capabilities demonstrated by Apollo Research apollo2024scheming.
  • ...and 13 more figures