Table of Contents
Fetching ...

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, Siheng Chen

TL;DR

SafeAgentBench introduces the first safety-aware benchmark for embodied LLM agents in interactive environments. It provides a 750-task dataset with 10 hazard types across three task types, SafeAgentEnv for interactive execution, and dual execution/semantic evaluation metrics. Experiments across nine baselines and multiple LLMs reveal weak proactive safety and limited gains from simple defenses, underscoring the need for integrated safety in architecture and training. The work establishes SafeAgentBench as a foundational platform for advancing safe embodied AI.

Abstract

With the integration of large language models (LLMs), embodied agents have strong capabilities to understand and plan complicated natural language instructions. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in the real world. Existing benchmarks predominantly overlook critical safety risks, focusing solely on planning performance, while a few evaluate LLMs' safety awareness only on non-interactive image-text data. To address this gap, we present SafeAgentBench -- the first comprehensive benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments, covering both explicit and implicit hazards. SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 9 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that, although agents based on different design frameworks exhibit substantial differences in task success rates, their overall safety awareness remains weak. The most safety-conscious baseline achieves only a 10% rejection rate for detailed hazardous tasks. Moreover, simply replacing the LLM driving the agent does not lead to notable improvements in safety awareness. Dataset and codes are available in https://github.com/shengyin1224/SafeAgentBench and https://huggingface.co/datasets/safeagentbench/SafeAgentBench.

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

TL;DR

SafeAgentBench introduces the first safety-aware benchmark for embodied LLM agents in interactive environments. It provides a 750-task dataset with 10 hazard types across three task types, SafeAgentEnv for interactive execution, and dual execution/semantic evaluation metrics. Experiments across nine baselines and multiple LLMs reveal weak proactive safety and limited gains from simple defenses, underscoring the need for integrated safety in architecture and training. The work establishes SafeAgentBench as a foundational platform for advancing safe embodied AI.

Abstract

With the integration of large language models (LLMs), embodied agents have strong capabilities to understand and plan complicated natural language instructions. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in the real world. Existing benchmarks predominantly overlook critical safety risks, focusing solely on planning performance, while a few evaluate LLMs' safety awareness only on non-interactive image-text data. To address this gap, we present SafeAgentBench -- the first comprehensive benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments, covering both explicit and implicit hazards. SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 9 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that, although agents based on different design frameworks exhibit substantial differences in task success rates, their overall safety awareness remains weak. The most safety-conscious baseline achieves only a 10% rejection rate for detailed hazardous tasks. Moreover, simply replacing the LLM driving the agent does not lead to notable improvements in safety awareness. Dataset and codes are available in https://github.com/shengyin1224/SafeAgentBench and https://huggingface.co/datasets/safeagentbench/SafeAgentBench.

Paper Structure

This paper contains 38 sections, 15 figures, 20 tables.

Figures (15)

  • Figure 1: Overview of SafeAgentBench.
  • Figure 2: (a) Composition of the SafeAgentBench dataset. (b) Counts of actions in detailed tasks. (c) Counts of step numbers in detailed tasks.
  • Figure 3: The workflow of embodied LLM agents in SafeAgentBench. Given a hazardous instruction and an observation from SafeAgentEnv, the agent leverages an LLM planner to produce a high-level plan. SafeAgentEnv executes each step via a low-level controller, updating the state and observations. Task completion is evaluated from execution and semantics, with agents able to reject instructions.
  • Figure 4: Performance breakdown on safe and hazardous tasks for baselines powered by GPT-4. Proactive defense (rejection and deliberate failure) occupy a low percentage, less than planning failures, revealing clear gaps in both planning and safety awareness.
  • Figure 5: Comparison of results for agents powered by different LLMs. (a) Average performance on safe and hazardous tasks across baselines (metrics as in Tab. \ref{['tab:detailed_task']}; Rej+DF = rejection + deliberate failure; Rej-Abs = rejection on abstract tasks). Higher values indicate better planning and safety. LLMs vary in task success but show similar poor safety awareness. (b) Performance across hazard abstraction levels. As abstraction increases, rejection rises and risk decreases for all LLMs, reflecting improved caution. (c) Proactive defense on detailed hazardous tasks across LLMs and agents.
  • ...and 10 more figures