Table of Contents
Fetching ...

OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

Ming Wen, Kun Yang, Jingyu Zhang, Yuxuan Liu, shiwen cui, Shouling Ji, Xingjun Ma

TL;DR

The Consequence-Aware Safety Policy Optimization (CASPO) framework, which integrates the model's intrinsic reasoning as a dynamic reference for token-level self-distillation rewards, is developed, which integrates the model's intrinsic reasoning as a dynamic reference for token-level self-distillation rewards.

Abstract

While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention, current paradigms primarily target malicious intent or situational violations. We propose shifting the safety frontier toward consequence-driven safety, a paradigm essential for the robust deployment of autonomous and embodied agents. To formalize this shift, we introduce OOD-MMSafe, a benchmark comprising 455 curated query-image pairs designed to evaluate a model's ability to identify latent hazards within context-dependent causal chains. Our analysis reveals a pervasive causal blindness among frontier models, with the highest 67.5% failure rate in high-capacity closed-source models, and identifies a preference ceiling where static alignment yields format-centric failures rather than improved safety reasoning as model capacity grows. To address these bottlenecks, we develop the Consequence-Aware Safety Policy Optimization (CASPO) framework, which integrates the model's intrinsic reasoning as a dynamic reference for token-level self-distillation rewards. Experimental results demonstrate that CASPO significantly enhances consequence projection, reducing the failure ratio of risk identification to 7.3% for Qwen2.5-VL-7B and 5.7% for Qwen3-VL-4B while maintaining overall effectiveness.

OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

TL;DR

The Consequence-Aware Safety Policy Optimization (CASPO) framework, which integrates the model's intrinsic reasoning as a dynamic reference for token-level self-distillation rewards, is developed, which integrates the model's intrinsic reasoning as a dynamic reference for token-level self-distillation rewards.

Abstract

While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention, current paradigms primarily target malicious intent or situational violations. We propose shifting the safety frontier toward consequence-driven safety, a paradigm essential for the robust deployment of autonomous and embodied agents. To formalize this shift, we introduce OOD-MMSafe, a benchmark comprising 455 curated query-image pairs designed to evaluate a model's ability to identify latent hazards within context-dependent causal chains. Our analysis reveals a pervasive causal blindness among frontier models, with the highest 67.5% failure rate in high-capacity closed-source models, and identifies a preference ceiling where static alignment yields format-centric failures rather than improved safety reasoning as model capacity grows. To address these bottlenecks, we develop the Consequence-Aware Safety Policy Optimization (CASPO) framework, which integrates the model's intrinsic reasoning as a dynamic reference for token-level self-distillation rewards. Experimental results demonstrate that CASPO significantly enhances consequence projection, reducing the failure ratio of risk identification to 7.3% for Qwen2.5-VL-7B and 5.7% for Qwen3-VL-4B while maintaining overall effectiveness.
Paper Structure (33 sections, 7 equations, 17 figures, 10 tables, 1 algorithm)

This paper contains 33 sections, 7 equations, 17 figures, 10 tables, 1 algorithm.

Figures (17)

  • Figure 1: Comparison of MLLM safety paradigms and reasoning depth. (Left) Comparison across intent-, situational-, and consequence-driven dimensions through representative scenarios. (Right) Evolution of safety depth from intent detection to causal projection.
  • Figure 2: Data examples of OOD-MMSafe.
  • Figure 3: The OOD-MMSafe Curation and Evaluation Pipeline. We (I) synthesize latent hazards using a rigorous multi-stage quality filter, (II) ground contexts via hybrid image sourcing, and (III) refine causal reasoning by mitigating speculative interventions and lexical-visual overlap. Finally, (IV) tripartite metrics ($R$, $S$, $E$) evaluate model hazard awareness.
  • Figure 4: Performance gains of risk awareness measured by $\Delta R_0$, representing the failure reduction for Risk Appraisal ($R$) in identifying hazards. (a) Performance gains of static alignment in addressing next-state hazards (Standard) versus current-state intentions (Malicious). (b) Comparison of performance gains between static alignment and the Safety Constitution in Standard Mode.
  • Figure 5: POS distributions of top-5 tokens with the highest KL divergence induced by safety alignment and the safety constitution. Static alignment becomes increasingly format-centric as model capability grows, whereas the safety constitution maintains a dynamic focus on semantic entities.
  • ...and 12 more figures