Table of Contents
Fetching ...

UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models

Segyu Lee, Boryeong Cho, Hojung Jung, Seokhyun An, Juhyeong Kim, Jaehyun Kwak, Yongjin Yang, Sangwon Jang, Youngrok Park, Wonjun Chang, Se-Young Yun

Abstract

Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safety risks not observed in single-task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system-level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for system-level safety evaluation of UMMs across 7 I/O modality combinations, spanning conventional tasks and novel multimodal-context image generation settings. UniSAFE is built with a shared-target design that projects common risk scenarios across task-specific I/O configurations, enabling controlled cross-task comparisons of safety failures. Comprising 6,802 curated instances, we use UniSAFE to evaluate 15 state-of-the-art UMMs, both proprietary and open-source. Our results reveal critical vulnerabilities across current UMMs, including elevated safety violations in multi-image composition and multi-turn settings, with image-output tasks consistently more vulnerable than text-output tasks. These findings highlight the need for stronger system-level safety alignment for UMMs. Our code and data are publicly available at https://github.com/segyulee/UniSAFE

UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models

Abstract

Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safety risks not observed in single-task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system-level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for system-level safety evaluation of UMMs across 7 I/O modality combinations, spanning conventional tasks and novel multimodal-context image generation settings. UniSAFE is built with a shared-target design that projects common risk scenarios across task-specific I/O configurations, enabling controlled cross-task comparisons of safety failures. Comprising 6,802 curated instances, we use UniSAFE to evaluate 15 state-of-the-art UMMs, both proprietary and open-source. Our results reveal critical vulnerabilities across current UMMs, including elevated safety violations in multi-image composition and multi-turn settings, with image-output tasks consistently more vulnerable than text-output tasks. These findings highlight the need for stronger system-level safety alignment for UMMs. Our code and data are publicly available at https://github.com/segyulee/UniSAFE
Paper Structure (107 sections, 13 equations, 29 figures, 10 tables)

This paper contains 107 sections, 13 equations, 29 figures, 10 tables.

Figures (29)

  • Figure 1: Examples of outputs generated by UniSAFE. Our benchmark consists of risk scenarios centered on a common target across 7 distinct task types, enabling evaluation across diverse risk settings.
  • Figure 2: Overview of the UniSAFE three-step data construction pipeline: (1) collect unsafe triggers across threat categories, (2) expand them into contextual target descriptions, and (3) instantiate shared, multimodal task-specific risk scenarios for safety evaluation of UMMs.
  • Figure 3: Taxonomy of safety categories for image and text modalities.
  • Figure 4: Refusal Rates for commercial UMMs across different tasks. Refusal Rates are further decomposed into system-level Refusal Rates and model-level Refusal Rates.
  • Figure 5: Safety risk across tasks and modalities in commercial UMMs. For GPT-5, Gemini-2.5, and Qwen-image, the bars show the proportions of test samples that produce harmful content (moderate- and high-risk) across 7 task types. Image-output tasks (text-to-image, image editing, image composition, multi-turn) consistently exhibit higher harmful content rates than text-output tasks (text-to-text, image-to-text, multimodal understanding), revealing strong modality-dependent bias in safety alignment.
  • ...and 24 more figures

Theorems & Definitions (3)

  • Definition 3.1: Characterizing unified tasks
  • Definition 3.2: Self-Awareness Score (SAS)
  • Definition B.1: Generalized Multi-Turn and Multi-Modal Task