A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents
Yuting Huang, Leilei Ding, Zhipeng Tang, Tianfu Wang, Xinrui Lin, Wuyang Zhang, Mingxiao Ma, Yanyong Zhang
TL;DR
The paper addresses safety hazards that arise during LLM-driven embodied task planning and presents Safe-BeAl, a dual framework comprising SafePlan-Bench (for comprehensive safety benchmarking) and Safe-Align (for aligning agents with physical-world safety knowledge). SafePlan-Bench formalizes embodied task-planning safety via Process and Termination constraints, builds the SafeRisks hazard dataset (2,027 samples across 8 hazard categories), and integrates a safety detector with VirtualHome to jointly evaluate safety and task success. Safe-Align introduces atomic-action level alignment using a weighted Bradley–Terry–inspired objective to emphasize error-prone steps while preserving planning performance, trained on a paired safe/unsafe action dataset. Across multiple embodied baselines, Safe-BeAl improves safety by 8.55–15.22% over GPT-4 baselines while maintaining task completion, demonstrating a practical pathway to safer real-world deployment of LLM-based embodied agents.
Abstract
Large Language Models (LLMs) exhibit substantial promise in enhancing task-planning capabilities within embodied agents due to their advanced reasoning and comprehension. However, the systemic safety of these agents remains an underexplored frontier. In this study, we present Safe-BeAl, an integrated framework for the measurement (SafePlan-Bench) and alignment (Safe-Align) of LLM-based embodied agents' behaviors. SafePlan-Bench establishes a comprehensive benchmark for evaluating task-planning safety, encompassing 2,027 daily tasks and corresponding environments distributed across 8 distinct hazard categories (e.g., Fire Hazard). Our empirical analysis reveals that even in the absence of adversarial inputs or malicious intent, LLM-based agents can exhibit unsafe behaviors. To mitigate these hazards, we propose Safe-Align, a method designed to integrate physical-world safety knowledge into LLM-based embodied agents while maintaining task-specific performance. Experiments across a variety of settings demonstrate that Safe-BeAl provides comprehensive safety validation, improving safety by 8.55 - 15.22%, compared to embodied agents based on GPT-4, while ensuring successful task completion.
