HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

Xuhui Zhou; Hyunwoo Kim; Faeze Brahman; Liwei Jiang; Hao Zhu; Ximing Lu; Frank Xu; Bill Yuchen Lin; Yejin Choi; Niloofar Mireshghallah; Ronan Le Bras; Maarten Sap

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

Xuhui Zhou, Hyunwoo Kim, Faeze Brahman, Liwei Jiang, Hao Zhu, Ximing Lu, Frank Xu, Bill Yuchen Lin, Yejin Choi, Niloofar Mireshghallah, Ronan Le Bras, Maarten Sap

TL;DR

HAICOSYSTEM presents a modular sandbox to evaluate AI agent safety within holistic, multi-turn human-AI-environment interactions that include tool use across diverse domains. It introduces HAICOSYSTEM-EVAL, a comprehensive LM-based evaluation framework that assesses Targeted, System/Operational, Content, Societal, and Legal risks, plus Efficiency and Goal attainment. Large-scale experiments across 132 scenarios and 12 models reveal widespread safety risks, particularly with malicious users and complex tool interactions, underscoring the need for holistic ecosystem evaluation rather than isolated, single-turn tests. The authors also provide a code platform enabling scenario authoring, simulation, and safety evaluation to advance practical, reproducible safety research in real-world human-AI collaboration.

Abstract

AI agents are increasingly autonomous in their interactions with human users and tools, leading to increased interactional safety risks. We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions. HAICOSYSTEM features a modular sandbox environment that simulates multi-turn interactions between human users and AI agents, where the AI agents are equipped with a variety of tools (e.g., patient management platforms) to navigate diverse scenarios (e.g., a user attempting to access other patients' profiles). To examine the safety of AI agents in these interactions, we develop a comprehensive multi-dimensional evaluation framework that uses metrics covering operational, content-related, societal, and legal risks. Through running 1840 simulations based on 92 scenarios across seven domains (e.g., healthcare, finance, education), we demonstrate that HAICOSYSTEM can emulate realistic user-AI interactions and complex tool use by AI agents. Our experiments show that state-of-the-art LLMs, both proprietary and open-sourced, exhibit safety risks in over 50\% cases, with models generally showing higher risks when interacting with simulated malicious users. Our findings highlight the ongoing challenge of building agents that can safely navigate complex interactions, particularly when faced with malicious users. To foster the AI agent safety ecosystem, we release a code platform that allows practitioners to create custom scenarios, simulate interactions, and evaluate the safety and performance of their agents.

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

TL;DR

Abstract

Paper Structure (38 sections, 10 figures, 9 tables)

This paper contains 38 sections, 10 figures, 9 tables.

Introduction
Background
Constructing the HAICOSYSTEM
Populating Scenarios
Evaluating Safety of AI Agents
Agent Safety Experiments
Experimental Setup and Simulation Validation
Benchmarking Safety Risks of AI Agents
Multi-turn interactions matter for AI agent safety
Analysis of Reasoning Models
Conclusion & Discussion
Ethics and Reproducibility Statement
Extended Related Work
Challenges and Approaches in Automated Red-Teaming
Simulating Social Interactions
...and 23 more sections

Figures (10)

Figure 1: An overview of HAICOSYSTEM. The framework enables simultaneous simulation of interactions between users, AI agents, and environments. The left side shows an example scenario from 132 scenarios in HAICOSYSTEM covering diverse domains and user intent types (benign and malicious). The right side shows an example simulation where the AI agent follows the simulated user's instructions to prescribe a controlled medication to a patient without verification. After the simulation, the framework uses a set of metrics ( HAICOSYSTEM-EVAL; §\ref{['sec:evaluation']}) to evaluate the safety of the AI agent as well as its performance.
Figure 2: The risk ratio of models for different risk dimensions across simulated episodes. Overall dimension refers to an episode being considered as risky overall if any individual risk dimension is negative. The higher the risk ratio is, the more likely the model is to exhibit certain safety risks. The table shows the overall risk ratio for all benchmarked models, while the bar chart displays dimension-wise risk ratios for representative models.
Figure 3: The qualitative examples of the episodes where the AI agents interact with human users with both malicious (left) and benign (right) intents.
Figure 4: The overall risk ratio of each model between benign and malicious human user intents. "W/ or w/o tools" represents the risk ratio from scenarios where AI agents either have access to tools or do not, respectively.
Figure 5: The overall risk ratio between single-turn and multi-turn settings for AI agents powered by GPT-4-turbo in scenarios adapted from representative jailbreaking benchmarks.
...and 5 more figures

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

TL;DR

Abstract

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

Authors

TL;DR

Abstract

Table of Contents

Figures (10)