Table of Contents
Fetching ...

BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments

Yuxuan Li, Yi Lin, Peng Wang, Shiming Liu, Xuetao Wei

Abstract

The rapid evolution of Large Multimodal Models (LMMs) has enabled agents to perform complex digital and physical tasks, yet their deployment as autonomous decision-makers introduces substantial unintentional behavioral safety risks. However, the absence of a comprehensive safety benchmark remains a major bottleneck, as existing evaluations rely on low-fidelity environments, simulated APIs, or narrowly scoped tasks. To address this gap, we present BeSafe-Bench (BSB), a benchmark for exposing behavioral safety risks of situated agents in functional environments, covering four representative domains: Web, Mobile, Embodied VLM, and Embodied VLA. Using functional environments, we construct a diverse instruction space by augmenting tasks with nine categories of safety-critical risks, and adopt a hybrid evaluation framework that combines rule-based checks with LLM-as-a-judge reasoning to assess real environmental impacts. Evaluating 13 popular agents reveals a concerning trend: even the best-performing agent completes fewer than 40% of tasks while fully adhering to safety constraints, and strong task performance frequently coincides with severe safety violations. These findings underscore the urgent need for improved safety alignment before deploying agentic systems in real-world settings.

BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments

Abstract

The rapid evolution of Large Multimodal Models (LMMs) has enabled agents to perform complex digital and physical tasks, yet their deployment as autonomous decision-makers introduces substantial unintentional behavioral safety risks. However, the absence of a comprehensive safety benchmark remains a major bottleneck, as existing evaluations rely on low-fidelity environments, simulated APIs, or narrowly scoped tasks. To address this gap, we present BeSafe-Bench (BSB), a benchmark for exposing behavioral safety risks of situated agents in functional environments, covering four representative domains: Web, Mobile, Embodied VLM, and Embodied VLA. Using functional environments, we construct a diverse instruction space by augmenting tasks with nine categories of safety-critical risks, and adopt a hybrid evaluation framework that combines rule-based checks with LLM-as-a-judge reasoning to assess real environmental impacts. Evaluating 13 popular agents reveals a concerning trend: even the best-performing agent completes fewer than 40% of tasks while fully adhering to safety constraints, and strong task performance frequently coincides with severe safety violations. These findings underscore the urgent need for improved safety alignment before deploying agentic systems in real-world settings.

Paper Structure

This paper contains 23 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An overview of the BeSafe-Bench. BeSafe-Bench generates a safety-critical task dataset by integrating initial tasks with predefined safety risk types and factors. Through multi-round dynamic interactions between agents and functional environments, environment states, and agent trajectories are recorded. These data are subsequently processed via a hybrid evaluation framework to assess task completion rates and safety coverage.
  • Figure 2: Evaluation of task success and safety compliance across different web environments (a) and risk categories (b)
  • Figure 3: Task completion and safety rates under diverse risk conditions on LIBERO-90 and BSB-EmbodiedVLA. LIBERO-90 consists of basic tasks in risk-free settings and thus does not include a safety rate, while all other metrics are evaluated on BSB-EmbodiedVLA.