IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao
TL;DR
IS-Bench introduces the first interactive, multi-modal benchmark for evaluating safety of VLM-driven embodied agents in daily household tasks, addressing the gap left by static evaluation. It combines a process-oriented safety evaluation with a data-generation pipeline that injects dynamic risks and formal safety goals, enabling rigorous analysis of risk perception and mitigation during interaction. Experiments across 16 agents reveal a persistent safety-accuracy trade-off: safety-focused prompts improve risk handling but can hamper task completion, while explicit hazard knowledge yields strong safety recall but relies on prior risk perception. The work highlights perception and awareness as key bottlenecks and provides a foundation for developing safer embodied AI through improved multi-modal perception and planning strategies.
Abstract
Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under https://github.com/AI45Lab/IS-Bench.
