Table of Contents
Fetching ...

HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang, Jun Xu

Abstract

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.

HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Abstract

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.
Paper Structure (40 sections, 8 equations, 13 figures, 9 tables)

This paper contains 40 sections, 8 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: HomeSafe-Bench construction pipeline. From LLM-generated hazard causes across six household locations, we create video descriptions, generate videos combining physical simulation with video generation models, and produce multi-dimensional annotations with quality checks.
  • Figure 2: HomeSafe-Bench Statistics. Left: Hierarchical taxonomy of household risks, spanning danger categories and severity levels. Right: Distribution analysis. Legend: Danger Categories (C1--C4), Severity (L1--L4), Reasoning Difficulty (D1--D3).
  • Figure 3: Temporal phase scoring and key frame definitions. The timeline is divided into five phases based on four annotated key frames. Detections in the Optimal window earn maximum scores (100) to reward early intervention, whereas delayed detections are strictly penalized.
  • Figure 4: Illustration of our proposed HD-Guard architecture that employs a hierarchical dual-brain system. The FastBrain continuously monitors video streams, classifying each frame's safety state into traffic-light categories (Green/Yellow/Red), dynamically adjusting sampling rates and triggering the SlowBrain for ambiguous Yellow states. The SlowBrain performs deep semantic reasoning with physical common sense through structured CoT analysis. These two modules collaborate asynchronously with a priority mechanism ensuring immediate FastBrain overrides for Red alerts while awaiting SlowBrain verdicts for complex scenarios.
  • Figure 5: Severity and Latency Evaluation. Left: Distribution of severity assessments across VLMs. indicates underestimation (dangerous), indicates correct predictions, and indicates overestimation (conservative). Right: Efficiency-Quality Trade-off. HD-Guard achieves a superior balance between low latency and high safety scores. The gray dashed line shows the Pareto frontier.
  • ...and 8 more figures