Table of Contents
Fetching ...

Towards Evaluating Proactive Risk Awareness of Multimodal Language Models

Youliang Yuan, Wenxiang Jiao, Yuejin Xie, Chihao Shen, Menghan Tian, Wenxuan Wang, Jen-tse Huang, Pinjia He

TL;DR

This paper introduces PaSBench, a proactive safety benchmark that tests multimodal language models on their ability to observe everyday environments, recognize potential risks, and provide timely alerts without user prompts. The dataset comprises 416 samples (128 image-based and 288 text-based) across five domains (Home, Outdoor, Sports, Food, Disaster) and 288 knowledge points, generated via a human-in-the-loop pipeline using GPT-4o and DeepSeek-R1 for image and log data. Across 36 models, results show only modest proactive risk detection with top performers reaching about 71% image and 64% text accuracy, and robustness often below 0.55, indicating unreliable proactive reminders in real-world use. Analyses reveal that the main bottleneck lies in unstable proactive reasoning and recall of safety knowledge rather than mere understanding of text or images, suggesting future work in training with proactive data, online reinforcement learning, and propose-then-verify pipelines to stabilize alerts. The dataset and findings aim to spur development of safer AI assistants that actively prevent harm, not just respond to user queries.

Abstract

Human safety awareness gaps often prevent the timely recognition of everyday risks. In solving this problem, a proactive safety artificial intelligence (AI) system would work better than a reactive one. Instead of just reacting to users' questions, it would actively watch people's behavior and their environment to detect potential dangers in advance. Our Proactive Safety Bench (PaSBench) evaluates this capability through 416 multimodal scenarios (128 image sequences, 288 text logs) spanning 5 safety-critical domains. Evaluation of 36 advanced models reveals fundamental limitations: Top performers like Gemini-2.5-pro achieve 71% image and 64% text accuracy, but miss 45-55% risks in repeated trials. Through failure analysis, we identify unstable proactive reasoning rather than knowledge deficits as the primary limitation. This work establishes (1) a proactive safety benchmark, (2) systematic evidence of model limitations, and (3) critical directions for developing reliable protective AI. We believe our dataset and findings can promote the development of safer AI assistants that actively prevent harm rather than merely respond to requests. Our dataset can be found at https://huggingface.co/datasets/Youliang/PaSBench.

Towards Evaluating Proactive Risk Awareness of Multimodal Language Models

TL;DR

This paper introduces PaSBench, a proactive safety benchmark that tests multimodal language models on their ability to observe everyday environments, recognize potential risks, and provide timely alerts without user prompts. The dataset comprises 416 samples (128 image-based and 288 text-based) across five domains (Home, Outdoor, Sports, Food, Disaster) and 288 knowledge points, generated via a human-in-the-loop pipeline using GPT-4o and DeepSeek-R1 for image and log data. Across 36 models, results show only modest proactive risk detection with top performers reaching about 71% image and 64% text accuracy, and robustness often below 0.55, indicating unreliable proactive reminders in real-world use. Analyses reveal that the main bottleneck lies in unstable proactive reasoning and recall of safety knowledge rather than mere understanding of text or images, suggesting future work in training with proactive data, online reinforcement learning, and propose-then-verify pipelines to stabilize alerts. The dataset and findings aim to spur development of safer AI assistants that actively prevent harm, not just respond to user queries.

Abstract

Human safety awareness gaps often prevent the timely recognition of everyday risks. In solving this problem, a proactive safety artificial intelligence (AI) system would work better than a reactive one. Instead of just reacting to users' questions, it would actively watch people's behavior and their environment to detect potential dangers in advance. Our Proactive Safety Bench (PaSBench) evaluates this capability through 416 multimodal scenarios (128 image sequences, 288 text logs) spanning 5 safety-critical domains. Evaluation of 36 advanced models reveals fundamental limitations: Top performers like Gemini-2.5-pro achieve 71% image and 64% text accuracy, but miss 45-55% risks in repeated trials. Through failure analysis, we identify unstable proactive reasoning rather than knowledge deficits as the primary limitation. This work establishes (1) a proactive safety benchmark, (2) systematic evidence of model limitations, and (3) critical directions for developing reliable protective AI. We believe our dataset and findings can promote the development of safer AI assistants that actively prevent harm rather than merely respond to requests. Our dataset can be found at https://huggingface.co/datasets/Youliang/PaSBench.

Paper Structure

This paper contains 44 sections, 1 equation, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustrative examples from our PaSBench and existing human safety datasets: SafeText levy2022safetext, RESPONSE diallo2025response, HealthBench healthbench, MSSBench zhou2025multimodal, and LabsafetyBench zhou2024labsafety.
  • Figure 2: Pipeline for dataset construction.
  • Figure 3: Risk detection rates of multi-modal language models on the image set.
  • Figure 4: Risk detection rates of language models on the text set.
  • Figure 5: Accuracies on the image set, a subset of the text set, and the multiple-choice question answering (QA) set. All three sets cover the same 128 knowledge points.
  • ...and 7 more figures