Testing Language Model Agents Safely in the Wild
Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau
TL;DR
The paper tackles the challenge of safely testing highly capable language-model agents in open-world settings by introducing a context-sensitive safety monitor (AgentMonitor) that can halt unsafe actions and logging suspect behavior. It constructs a 29-task benchmark with 1,965 agent outputs and a 57-entry unsafe/off-task subset to evaluate monitoring performance, and demonstrates that a GPT-based monitor can achieve an F1 score of 89.4% (ROC AUC 0.982) under optimized parameterization. Key contributions include a threat-model framework aligned with CIA principles, a scalable monitor design with tunable prompts, and empirical insights from ablation studies on safety boundaries. The work lays groundwork for safer real-world evaluations of autonomous agents and highlights avenues for dataset expansion, finer distinctions between unsafe and off-task behaviors, and cautious deployment practices to mitigate risks.
Abstract
A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet real-world autonomous tests face several unique safety challenges, both due to the possibility of causing harm during a test, as well as the risk of encountering new unsafe agent behavior through interactions with real-world and potentially malicious actors. We propose a framework for conducting safe autonomous agent tests on the open internet: agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans. We design a basic safety monitor (AgentMonitor) that is flexible enough to monitor existing LLM agents, and, using an adversarial simulated agent, we measure its ability to identify and stop unsafe situations. Then we apply the AgentMonitor on a battery of real-world tests of AutoGPT, and we identify several limitations and challenges that will face the creation of safe in-the-wild tests as autonomous agents grow more capable.
