Table of Contents
Fetching ...

ResponsibleRobotBench: Benchmarking Responsible Robot Manipulation using Multi-modal Large Language Models

Lei Zhang, Ju Dong, Kaixin Bai, Minheng Ni, Zoltan-Csaba Marton, Zhaopeng Chen, Jianwei Zhang

TL;DR

ResponsibleRobotBench introduces a multi-task, multimodal benchmark to evaluate risk-aware, safe robotic manipulation driven by LLMs/LMMs. The framework combines hazard-aware task suites (electrical, fire/chemical, human hazards) with modular action representations, in-context learning, cognition-informed prompting, and a rigorous evaluation suite including safety, success, and cost metrics. Extensive experiments across hazard categories, action modalities, human-in-the-loop settings, and prompt strategies reveal strengths and limitations of current LMM-driven agents, particularly in spatial planning and long-horizon tasks, while highlighting the value of human oversight for safety. This work establishes a reproducible, extensible platform to drive the development of trustworthy, physically grounded robotic systems capable of safe operation in complex real-world environments.

Abstract

Recent advances in large multimodal models have enabled new opportunities in embodied AI, particularly in robotic manipulation. These models have shown strong potential in generalization and reasoning, but achieving reliable and responsible robotic behavior in real-world settings remains an open challenge. In high-stakes environments, robotic agents must go beyond basic task execution to perform risk-aware reasoning, moral decision-making, and physically grounded planning. We introduce ResponsibleRobotBench, a systematic benchmark designed to evaluate and accelerate progress in responsible robotic manipulation from simulation to real world. This benchmark consists of 23 multi-stage tasks spanning diverse risk types, including electrical, chemical, and human-related hazards, and varying levels of physical and planning complexity. These tasks require agents to detect and mitigate risks, reason about safety, plan sequences of actions, and engage human assistance when necessary. Our benchmark includes a general-purpose evaluation framework that supports multimodal model-based agents with various action representation modalities. The framework integrates visual perception, context learning, prompt construction, hazard detection, reasoning and planning, and physical execution. It also provides a rich multimodal dataset, supports reproducible experiments, and includes standardized metrics such as success rate, safety rate, and safe success rate. Through extensive experimental setups, ResponsibleRobotBench enables analysis across risk categories, task types, and agent configurations. By emphasizing physical reliability, generalization, and safety in decision-making, this benchmark provides a foundation for advancing the development of trustworthy, real-world responsible dexterous robotic systems. https://sites.google.com/view/responsible-robotbench

ResponsibleRobotBench: Benchmarking Responsible Robot Manipulation using Multi-modal Large Language Models

TL;DR

ResponsibleRobotBench introduces a multi-task, multimodal benchmark to evaluate risk-aware, safe robotic manipulation driven by LLMs/LMMs. The framework combines hazard-aware task suites (electrical, fire/chemical, human hazards) with modular action representations, in-context learning, cognition-informed prompting, and a rigorous evaluation suite including safety, success, and cost metrics. Extensive experiments across hazard categories, action modalities, human-in-the-loop settings, and prompt strategies reveal strengths and limitations of current LMM-driven agents, particularly in spatial planning and long-horizon tasks, while highlighting the value of human oversight for safety. This work establishes a reproducible, extensible platform to drive the development of trustworthy, physically grounded robotic systems capable of safe operation in complex real-world environments.

Abstract

Recent advances in large multimodal models have enabled new opportunities in embodied AI, particularly in robotic manipulation. These models have shown strong potential in generalization and reasoning, but achieving reliable and responsible robotic behavior in real-world settings remains an open challenge. In high-stakes environments, robotic agents must go beyond basic task execution to perform risk-aware reasoning, moral decision-making, and physically grounded planning. We introduce ResponsibleRobotBench, a systematic benchmark designed to evaluate and accelerate progress in responsible robotic manipulation from simulation to real world. This benchmark consists of 23 multi-stage tasks spanning diverse risk types, including electrical, chemical, and human-related hazards, and varying levels of physical and planning complexity. These tasks require agents to detect and mitigate risks, reason about safety, plan sequences of actions, and engage human assistance when necessary. Our benchmark includes a general-purpose evaluation framework that supports multimodal model-based agents with various action representation modalities. The framework integrates visual perception, context learning, prompt construction, hazard detection, reasoning and planning, and physical execution. It also provides a rich multimodal dataset, supports reproducible experiments, and includes standardized metrics such as success rate, safety rate, and safe success rate. Through extensive experimental setups, ResponsibleRobotBench enables analysis across risk categories, task types, and agent configurations. By emphasizing physical reliability, generalization, and safety in decision-making, this benchmark provides a foundation for advancing the development of trustworthy, real-world responsible dexterous robotic systems. https://sites.google.com/view/responsible-robotbench

Paper Structure

This paper contains 61 sections, 1 equation, 8 figures, 16 tables.

Figures (8)

  • Figure 1: ResponsibleRobotBench is a comprehensive evaluation framework for assessing the reliability, safety, and risk awareness of robotic manipulation systems powered by large language models (LLMs) and vision-language models (VLMs). The benchmark supports diverse action representation modalities—including predefined skills, manipulation poses, and code generation—and categorizes tasks across multiple axes such as hazard type, planning difficulty, and instruction intent (e.g., normal, attack, or defense). Fine-grained evaluation metrics are used to assess an agent’s understanding of safety constraints and operational effectiveness in hazardous or ambiguous scenarios. This modular design enables standardized, scalable, and interpretable comparisons across a wide spectrum of embodied AI agents.
  • Figure 2: Our benchmark considers manipulation under varying scene complexities and motion planning difficulties for the same task. The workflow of the flower-watering task under different parameters is illustrated. Based on scene complexity, we design environments both with and without hazards. The robot adopts different grasping strategies depending on the planning complexity of the task. Planning complexity is categorized into simple and difficult. In the simple flower-watering scenario, the robot grasps the top handle of the watering can, which only requires positional offset planning of the end-effector. In contrast, the difficult scenario requires the robot to grasp the side handle, necessitating 6D pose planning of the end-effector.
  • Figure 3: Tasks are categorized based on whether they can be completed safely. Safety-executable tasks are designed to assess the agent’s ability to operate responsibly, while safety-violating tasks are used to investigate the agent's behavior in response to both offensive and defensive prompts.
  • Figure 4: Natural language commands are first input into the agent. These commands fall into three categories: general commands, attack-task commands, and defense-task commands. Through contextual learning, the robot can perform tasks without the need to retrain the large language model, relying instead on in-context examples. These examples may include visual inputs and demonstrations of robotic actions represented in various action representation formats. During prompt construction, prompts are configured with different parameter settings and output format specifications. Cognitive priors can also be incorporated into the prompt design to guide the agent toward responsible and context-aware behavior. The agent's output comprises visual descriptions, safety-aware and planning-related reasoning and reflection, hazard detection results, and the final task planning outcome. The predicted robotic actions are validated within a physical simulation environment. Throughout this process, observations from the environment are continuously collected to support subsequent action generation, performance evaluation, and detailed report analysis. The evaluation report assesses the agent’s output format and structure, analyzes hazard detection accuracy, quantifies task success and safety rates, and identifies potential causes of planning failures .
  • Figure 5: Evaluation metric results and corresponding fine-grained error analysis. (a) Performance of tasks with potential hazards using different LMMs. (b) Performance of tasks without hazards. (c) Performance under different hazard categories using GPT-4o. (d) Example of fine-grained error analysis.
  • ...and 3 more figures