Multimodal Situational Safety

Kaiwen Zhou; Chengzhi Liu; Xuandong Zhao; Anderson Compalas; Dawn Song; Xin Eric Wang

Multimodal Situational Safety

Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Song, Xin Eric Wang

TL;DR

The paper defines Multimodal Situational Safety and introduces MSSBench, a benchmark for evaluating whether an MLLM can judge the safety of a language query given real-time visual context across chat and embodied tasks. It reveals that current open-source and proprietary MLLMs struggle with subtler safety cues, especially in embodied scenarios, and identifies explicit safety reasoning and visual grounding as key bottlenecks. To address these gaps, the authors propose a multi-agent safety framework that decomposes tasks into specialized subtasks (intent reasoning, visual understanding, safety judgment, QA) and demonstrate systematic safety gains across models. While the approach reduces safety errors, the results also highlight persistent challenges and suggest future directions in safety-alignment and enhanced multimodal prompting to further improve robustness in real-world use.

Abstract

Multimodal Large Language Models (MLLMs) are rapidly evolving, demonstrating impressive capabilities as multimodal assistants that interact with both humans and their environments. However, this increased sophistication introduces significant safety concerns. In this paper, we present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety, which explores how safety considerations vary based on the specific situation in which the user or agent is engaged. We argue that for an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context. To evaluate this capability, we develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs. The dataset comprises 1,820 language query-image pairs, half of which the image context is safe, and the other half is unsafe. We also develop an evaluation framework that analyzes key safety aspects, including explicit safety reasoning, visual understanding, and, crucially, situational safety reasoning. Our findings reveal that current MLLMs struggle with this nuanced safety problem in the instruction-following setting and struggle to tackle these situational safety challenges all at once, highlighting a key area for future research. Furthermore, we develop multi-agent pipelines to coordinately solve safety challenges, which shows consistent improvement in safety over the original MLLM response. Code and data: mssbench.github.io.

Multimodal Situational Safety

TL;DR

Abstract

Paper Structure (44 sections, 35 figures, 8 tables)

This paper contains 44 sections, 35 figures, 8 tables.

Introduction
Related Work
MLLMs for Multimodal Assistants.
Multimodal Large Language Model Safety.
Multimodal Situational Safety
Dataset Overview
Problem Definition.
Dataset Description.
Multimodal Situational Safety Category.
Chat Data Collection
Generation of Intend Activity and Textual Unsafe Situations.
Automatic Filtering with LLM.
Construction of Multimodal Situational Safety Dataset through Image Retrieval.
Human Verification and Query Generation.
Embodied Data Collection
...and 29 more sections

Figures (35)

Figure 1: Illustration of multimodal situational safety. The model must judge the safety of the user's query or instruction based on the visual context and adjust their answer accordingly. Given an unsafe visual context, the model should remind the user of the potential risk instead of directly answering the user's query. However, current MLLMs struggle to achieve this in most unsafe situations.
Figure 2: Presentation of MSSBench across four domains and ten secondary categories in Chat and Embodied tasks.
Figure 3: The overall structure of the chat data collection pipeline (left) and examples of two multimodal assistant scenarios (right). The pipeline includes four parts: (1) Generating Intented Activity and Unsafe Textual Situations. (2) Iterative Filtering with LLM. (3) Constructing a Multimodal Situational Safety Dataset via Image Retrieval. (4) Human Verification & Query Generation.
Figure 4: Individual performance comparison.
Figure 5: Average performance comparison.
...and 30 more figures

Multimodal Situational Safety

TL;DR

Abstract

Multimodal Situational Safety

Authors

TL;DR

Abstract

Table of Contents

Figures (35)