Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model

Siyin Wang; Xingsong Ye; Qinyuan Cheng; Junwen Duan; Shimin Li; Jinlan Fu; Xipeng Qiu; Xuanjing Huang

Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model

Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, Xuanjing Huang

TL;DR

Safe Inputs but Unsafe Output (SIUO) defines a cross-modality safety challenge for LVLMs and introduces a dedicated benchmark spanning nine harmfulness domains. The authors construct SIUO via a hybrid human and AI-assisted data pipeline, yielding 167 human-crafted and 102 AI-assisted test cases with safe image-text pairs that can produce unsafe outputs when fused semantically. The benchmark is validated with automated safety filters and human review, and evaluated across 15 LVLMs in zero-shot settings using text generation and MCQA tasks, with safety and effectiveness measured by human judgments and GPT-4V as an automated evaluator. Findings show substantial safety vulnerabilities even in strong models like GPT-4V, highlighting critical gaps in cross-modal integration, knowledge, and reasoning and underscoring the need for robust cross-modality safety alignment and improved evaluation methodologies.

Abstract

As Artificial General Intelligence (AGI) becomes increasingly integrated into various facets of human life, ensuring the safety and ethical alignment of such systems is paramount. Previous studies primarily focus on single-modality threats, which may not suffice given the integrated and complex nature of cross-modality interactions. We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. Specifically, it considers cases where single modalities are safe independently but could potentially lead to unsafe or unethical outputs when combined. To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, such as GPT-4V and LLaVA, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.

Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model

TL;DR

Abstract

Paper Structure (51 sections, 13 figures, 4 tables)

This paper contains 51 sections, 13 figures, 4 tables.

Introduction
Related Work
The SIUO Benchmark
Why Does Vision-language Context Lead to New Safety Challenges?
Constructing SIUO
Criteria for Selecting Images and Text
Human Curation
AI-Assited Curation
Quality Control
Dataset Structure and Content
Validating SIUO
Experiments
Models
Tasks and Evaluation
Task
...and 36 more sections

Figures (13)

Figure 1: An example of the SIUO (Safe Inputs but Unsafe Output). The input consists of a safe image and text, but their semantic combination is unsafe. Such inputs can also prompt LVLMs to generate unsafe output.
Figure 2: The safe rate of various LVLMs across multiple safety domains in the SIUO benchmark, highlighting significant ongoing safety vulnerabilities in current models, where safe rate means the ratio of the number of safe responses to the total number of responses.
Figure 3: Examples of safety risks that may arise due to the lack of integration, knowledge, and reasoning capabilities in LVLMs, even with safe image and text input.
Figure 4: The framework of AI-Assisted Curation. The model hypothesizes unsafe events based on the image and generates a test sample (Step 1), then refines it by reflecting on information redundancy and completeness (Step 2), ensures safety via a text-only judge (Step 3), and finally, human reviewers select the sample for safety, difficulty, informativeness, and edit as necessary (Step 4).
Figure 5: SIUO covers 9 safety domains and 33 subcategories. Examples can be found in Appendix \ref{['app: exp']}.
...and 8 more figures

Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model

TL;DR

Abstract

Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (13)