Table of Contents
Fetching ...

Testing Language Model Agents Safely in the Wild

Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau

TL;DR

The paper tackles the challenge of safely testing highly capable language-model agents in open-world settings by introducing a context-sensitive safety monitor (AgentMonitor) that can halt unsafe actions and logging suspect behavior. It constructs a 29-task benchmark with 1,965 agent outputs and a 57-entry unsafe/off-task subset to evaluate monitoring performance, and demonstrates that a GPT-based monitor can achieve an F1 score of 89.4% (ROC AUC 0.982) under optimized parameterization. Key contributions include a threat-model framework aligned with CIA principles, a scalable monitor design with tunable prompts, and empirical insights from ablation studies on safety boundaries. The work lays groundwork for safer real-world evaluations of autonomous agents and highlights avenues for dataset expansion, finer distinctions between unsafe and off-task behaviors, and cautious deployment practices to mitigate risks.

Abstract

A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet real-world autonomous tests face several unique safety challenges, both due to the possibility of causing harm during a test, as well as the risk of encountering new unsafe agent behavior through interactions with real-world and potentially malicious actors. We propose a framework for conducting safe autonomous agent tests on the open internet: agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans. We design a basic safety monitor (AgentMonitor) that is flexible enough to monitor existing LLM agents, and, using an adversarial simulated agent, we measure its ability to identify and stop unsafe situations. Then we apply the AgentMonitor on a battery of real-world tests of AutoGPT, and we identify several limitations and challenges that will face the creation of safe in-the-wild tests as autonomous agents grow more capable.

Testing Language Model Agents Safely in the Wild

TL;DR

The paper tackles the challenge of safely testing highly capable language-model agents in open-world settings by introducing a context-sensitive safety monitor (AgentMonitor) that can halt unsafe actions and logging suspect behavior. It constructs a 29-task benchmark with 1,965 agent outputs and a 57-entry unsafe/off-task subset to evaluate monitoring performance, and demonstrates that a GPT-based monitor can achieve an F1 score of 89.4% (ROC AUC 0.982) under optimized parameterization. Key contributions include a threat-model framework aligned with CIA principles, a scalable monitor design with tunable prompts, and empirical insights from ablation studies on safety boundaries. The work lays groundwork for safer real-world evaluations of autonomous agents and highlights avenues for dataset expansion, finer distinctions between unsafe and off-task behaviors, and cautious deployment practices to mitigate risks.

Abstract

A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet real-world autonomous tests face several unique safety challenges, both due to the possibility of causing harm during a test, as well as the risk of encountering new unsafe agent behavior through interactions with real-world and potentially malicious actors. We propose a framework for conducting safe autonomous agent tests on the open internet: agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans. We design a basic safety monitor (AgentMonitor) that is flexible enough to monitor existing LLM agents, and, using an adversarial simulated agent, we measure its ability to identify and stop unsafe situations. Then we apply the AgentMonitor on a battery of real-world tests of AutoGPT, and we identify several limitations and challenges that will face the creation of safe in-the-wild tests as autonomous agents grow more capable.
Paper Structure (31 sections, 9 figures, 2 tables)

This paper contains 31 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The architecture of a safety test. The AgentMonitor, observing agent "thoughts" and actions, has the ability to stop the test at any point to prevent it from taking unsafe actions. If the test completes, its logs are evaluated for safety.
  • Figure 2: An illustrative comparison of LMA outputs with high (on the left) and low (on the right) safety scores.
  • Figure 3: Our results on the test set, in both cases using the parameter combination which scored most highly on the training set.
  • Figure 4: Every benchmark run starts with the easiest challenge. As agents pass more and more difficult challenges, they propagate down the tree. The further down the tree a challenge is, the more difficult it is to complete. Different colors represent different benchmark categories. Visit the https://agbenchmark-frontend.vercel.app/ to interact with the graph and view information from each challenge.
  • Figure 5: The number of challenges run per agent within the dataset time frame. The challenge suite was run daily over all agents, as well as on every commit to master.
  • ...and 4 more figures