EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents

Zihao Zhu; Bingzhe Wu; Zhengyou Zhang; Lei Han; Qingshan Liu; Baoyuan Wu

EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents

Zihao Zhu, Bingzhe Wu, Zhengyou Zhang, Lei Han, Qingshan Liu, Baoyuan Wu

TL;DR

This work tackles the safety challenge of deploying foundation-model-based embodied AI (EAI) agents in the physical world by introducing EARBench, an automated pre-deployment risk assessment framework. EARBench uses a multi-agent pipeline to generate safety guidelines, design risky scenes, plan tasks, and assess plan safety, yielding the EARDataset with 2,636 textual and visual test cases across seven domains. Large-scale evaluation reveals pervasive physical risk in AI-generated plans, with an average Task Risk Rate around 95% across both open- and closed-source models, and with larger models not reliably improving safety. The authors propose two prompt-based risk mitigation strategies (implicit and explicit), finding that explicit prompting generally yields stronger reductions in TRR but leaves substantial safety gaps, underlining the need for safety-aligned pre-training and architectural solutions. Together, EARBench and EARDataset provide a standardized toolkit and dataset to drive future improvements in the safety and reliability of embodied AI systems for real-world use.

Abstract

Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction. The emergence of foundation models as the "brain" of EAI agents for high-level task planning has shown promising results. However, the deployment of these agents in physical environments presents significant safety challenges. For instance, a housekeeping robot lacking sufficient risk awareness might place a metal container in a microwave, potentially causing a fire. To address these critical safety concerns, comprehensive pre-deployment risk assessments are imperative. This study introduces EARBench, a novel framework for automated physical risk assessment in EAI scenarios. EAIRiskBench employs a multi-agent cooperative system that leverages various foundation models to generate safety guidelines, create risk-prone scenarios, make task planning, and evaluate safety systematically. Utilizing this framework, we construct EARDataset, comprising diverse test cases across various domains, encompassing both textual and visual scenarios. Our comprehensive evaluation of state-of-the-art foundation models reveals alarming results: all models exhibit high task risk rates (TRR), with an average of 95.75% across all evaluated models. To address these challenges, we further propose two prompting-based risk mitigation strategies. While these strategies demonstrate some efficacy in reducing TRR, the improvements are limited, still indicating substantial safety concerns. This study provides the first large-scale assessment of physical risk awareness in EAI agents. Our findings underscore the critical need for enhanced safety measures in EAI systems and provide valuable insights for future research directions in developing safer embodied artificial intelligence system. Data and code are available at https://github.com/zihao-ai/EARBench.

EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents

TL;DR

Abstract

Paper Structure (24 sections, 2 equations, 6 figures)

This paper contains 24 sections, 2 equations, 6 figures.

Introduction
Results
EARBench
EARDataset Construction
Evaluation of Foundation Models as "Brain" of EAI Agents
Overall Comparison Across Various Foundation Models and Domains
Comparison between Open-source and Closed-source Models
Comparison between Textual and Visual Scenarios
Comparison between Various Model Sizes
Comparison of Consistency between Automated Evaluation and Human Evaluation
Evaluation of Risk Mitigation Strategies
Case Study
Discussion
Methods
Safety Guidelines Generation
...and 9 more sections

Figures (6)

Figure 1: General overview of the study.a, System overview: given a scene name as input, safety guidelines are generated, followed by the generation of scene observations (textual and visual), which are then processed by the embodied artificial intelligence (EAI) agent to formulate high-level task plans. At last, the system output the evaluated results. b, Detailed framework of EARBench, including four main modules: Safety Guidelines Generation, Risky Scene Generation, Embodied Task Planning, and Assessment. First, safety guidelines are generated based on the scene using a pre-trained LLM. Then, the risky scene generation module utilizes LLM to generate task instruction and detailed scene information specific to the safety tip, which are used to produce both textual and visual scene observations. The embodied task planning module then employs LLM/VLM models to produce high-level plans. Finally, safety and effectiveness of the plans are assessed by an LLM-based evaluator. c, The distribution of EARDataset : the collected test cases cover seven different domains, with the largest portions being home (38.3%) and commercial (23.0%). d, Evaluated Foundation Models, including open-source models like Llama-3.1, Mistral, Qwen, and DeepSeek, as well as closed-source models such as GPT-series, Claude 3, and Gemini 1.5 from companies like OpenAI, Anthropic, and Google.
Figure 2: Comparison of Task Risk Rate (TRR) and Task Effectiveness Rate (TER) across various foundation models. The dotted line separates open-source (left) from closed-source (right) models. Results show consistently high TRR (average 95.75%) across all models, indicating pervasive potential risks in AI-generated plans for EAI tasks. Notably, even GPT-4o, widely recognized as one of the most advanced language models, exhibits a high TRR of 94.03%. Simultaneously, TER remain relatively high (typically 80-95%), suggesting models are adept at generating executable plans but struggle with incorporating safety considerations. This stark contrast between high task effectiveness and poor risk awareness highlights a critical gap in the current capabilities of foundation models for safe EAI applications.
Figure 3: Domain-specific analysis of Task Risk Rate (TRR) across various foundation models. We evaluate TRR across seven different domains (Home, Commercial, Medical, Science, Industrial, Education, and Entertainment) for all evaluated models. Results demonstrate persistently high TRR (>90%) across all domains, emphasizing the universal challenge of physical risk for diverse EAI tasks. For each domain, a representative visual observation is provided, along with a corresponding safety tip and an example of a risky plan that contain the potential risk.
Figure 4: Analysis of Task Risk Rate (TRR) across different model types, scenarios, and sizes, and the impact of safety tips on evaluation consistency.a, Comparison of TRR between open-source and closed-source models: open-source models show higher average TRR of 96.38% with consistent performance, while closed-source models have slightly lower average TRR of 94.86% but greater variability. b, Comparison between textual and visual scenarios for multimodal VLMs: visual inputs show marginally lower TRR, indicating a slight advantage of visual information in risk identification, though the benefit is limited. c, Relationship between model size and TRR for open-source models: general trend of decreasing TRR with increased model size, e.g., Llama 3.1-8B at 96.93% vs Llama 3.1-70B at 95.51%, with exceptions like DeepSeek-V2 at 96.7% despite largest size. d, Comparison of consistency between automated evaluation and human evaluation: Using 200 randomly selected instances, each evaluated by five human annotators, we measured the agreement between automated and human evaluations using Cohen's Kappa. The Kappa value increases from 0.48 to 0.85 when safety tips are included in automated evaluation.
Figure 5: Evaluation of risk mitigation strategies for EAI task planning.a, Implicit risk mitigation strategy (RM-Implicit) uses general safety reminders in the prompt. b, Explicit risk mitigation strategy (RM-Explicit) provides specific safety tips in the prompt. c, Comparison of Task Risk Rates (TRR) across various models under different risk mitigation strategies. Both strategies reduce TRR, with the explicit strategy consistently outperforming the implicit one. Advanced closed-source models like GPT-4o and Claude 3 Haiku demonstrate larger TRR reductions with the explicit strategy, likely due to their enhanced understanding and reasoning capabilities. However, even the best-performing model (GPT-4o) maintains a TRR above 40% with explicit risk mitigation.
...and 1 more figures

EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents

TL;DR

Abstract

EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (6)