Table of Contents
Fetching ...

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao

TL;DR

IS-Bench introduces the first interactive, multi-modal benchmark for evaluating safety of VLM-driven embodied agents in daily household tasks, addressing the gap left by static evaluation. It combines a process-oriented safety evaluation with a data-generation pipeline that injects dynamic risks and formal safety goals, enabling rigorous analysis of risk perception and mitigation during interaction. Experiments across 16 agents reveal a persistent safety-accuracy trade-off: safety-focused prompts improve risk handling but can hamper task completion, while explicit hazard knowledge yields strong safety recall but relies on prior risk perception. The work highlights perception and awareness as key bottlenecks and provides a foundation for developing safer embodied AI through improved multi-modal perception and planning strategies.

Abstract

Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under https://github.com/AI45Lab/IS-Bench.

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

TL;DR

IS-Bench introduces the first interactive, multi-modal benchmark for evaluating safety of VLM-driven embodied agents in daily household tasks, addressing the gap left by static evaluation. It combines a process-oriented safety evaluation with a data-generation pipeline that injects dynamic risks and formal safety goals, enabling rigorous analysis of risk perception and mitigation during interaction. Experiments across 16 agents reveal a persistent safety-accuracy trade-off: safety-focused prompts improve risk handling but can hamper task completion, while explicit hazard knowledge yields strong safety recall but relies on prior risk perception. The work highlights perception and awareness as key bottlenecks and provides a foundation for developing safer embodied AI through improved multi-modal perception and planning strategies.

Abstract

Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under https://github.com/AI45Lab/IS-Bench.

Paper Structure

This paper contains 27 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Evaluation of Embodied Agents' Interactive Safety. IS-Bench employs (a) interactive evaluation scenarios that can simulate dynamic risks during interaction and (b) process-oriented evaluation approaches that provide accurate analysis.
  • Figure 2: Evaluation Scenarios Generation in IS-Bench.
  • Figure 3: Evaluation Framework in IS-Bench. Given multi-modal contexts, we test VLM-driven embodied agents using execution- and LLM-based safety evaluation.
  • Figure 4: Ablation on Different Visual Inputs. Except for the scene image, we also provide bounding box (BBox), self-generated captions (Caption), and initial setup (IS).
  • Figure S1: Data Statistics of IS-Bench. (a) Distribution of primary safety hazard categories. The three least frequent categories (broken damage, slipping hazard, and sharp object hazard) are excluded. (b) Task complexity in IS-Bench, measured by the length of the annotated plan. (c) Frequency distribution of primitive skills required by the tasks.
  • ...and 2 more figures