Table of Contents
Fetching ...

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang

TL;DR

This work tackles safety vulnerabilities in multimodal LLMs by exposing how image inputs can suppress pre-aligned safety mechanisms and proposes ECSO, a training-free protection that first assesses the safety of the model's own output and, if unsafe, converts the input image into a query-aware text caption to reactivate intrinsic safety; it then generates a safe response without the image. ECSO significantly improves safety across five state-of-the-art MLLMs on MM-SafetyBench and VLSafe (e.g., substantial percentage boosts) while preserving utility on standard benchmarks. Additionally, ECSO can function as a data engine to generate supervised-finetuning data for safety alignment without extra human labor, facilitating scalable, autonomous alignment. The approach relies on the model's own safety awareness and targeted I2T transformation to restore the safety gate, representing a practical, training-free defense that complements or substitutes traditional red-teaming and post-hoc filtering in multimodal contexts.

Abstract

Multimodal large language models (MLLMs) have shown impressive reasoning abilities. However, they are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting the unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed with the introduction of image features. To construct robust MLLMs, we propose ECSO (Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that ECSO enhances model safety significantly (e.g.,, 37.6% improvement on the MM-SafetyBench (SD+OCR) and 71.3% on VLSafe with LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

TL;DR

This work tackles safety vulnerabilities in multimodal LLMs by exposing how image inputs can suppress pre-aligned safety mechanisms and proposes ECSO, a training-free protection that first assesses the safety of the model's own output and, if unsafe, converts the input image into a query-aware text caption to reactivate intrinsic safety; it then generates a safe response without the image. ECSO significantly improves safety across five state-of-the-art MLLMs on MM-SafetyBench and VLSafe (e.g., substantial percentage boosts) while preserving utility on standard benchmarks. Additionally, ECSO can function as a data engine to generate supervised-finetuning data for safety alignment without extra human labor, facilitating scalable, autonomous alignment. The approach relies on the model's own safety awareness and targeted I2T transformation to restore the safety gate, representing a practical, training-free defense that complements or substitutes traditional red-teaming and post-hoc filtering in multimodal contexts.

Abstract

Multimodal large language models (MLLMs) have shown impressive reasoning abilities. However, they are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting the unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed with the introduction of image features. To construct robust MLLMs, we propose ECSO (Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that ECSO enhances model safety significantly (e.g.,, 37.6% improvement on the MM-SafetyBench (SD+OCR) and 71.3% on VLSafe with LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.
Paper Structure (53 sections, 4 equations, 19 figures, 17 tables)

This paper contains 53 sections, 4 equations, 19 figures, 17 tables.

Figures (19)

  • Figure 1: (left) MLLMs are vulnerable to malicious questions when queried with images but can restore safety when images are excluded. (right) Comparisons of harmless rate (%) of model responses with and without images on five state-of-the-art MLLMs.
  • Figure 2: (left) Though vulnerable to malicious questions, MLLMs are aware of the unsafe responses of their own. (right) Accuracy of MLLMs discrimination (with and without images) on whether their own responses are safe or not .
  • Figure 3: Overview of ECSO. Step 1: User queries are full-filled as usual. Step 2: The MLLM is prompted to judge whether its initial response is safe or not. Safe answers are returned, while unsafe ones proceed Step 3 and 4. Step 3: Images of unsafe queries are converted into texts via query-aware text-to-image transformation. Step 4: Malicious content in either images or user queries are now both represented by plain text, which can be deal with by the pre-aligned LLMs in MLLMs to generate safe responses.
  • Figure 4: Prompt templates for ECSO, where <image> denotes the presence of image inputs and {} denotes a placeholder for the actual text inputs.
  • Figure 5: Qualitative comparison showing how ECSO generates harmless response. Direct/Initial responses: Model response when directly prompted. This is also the initial response in the first step of ECSO. Harm?: Harmful content detection as in Sec. \ref{['sec_method_tell']}. Caption: Query-aware I2T captioning as in Sec. \ref{['sec_cap']}. ECSO: Safe response generation without images by ECSO as in Sec. \ref{['sec_safegen']}. Text in Red (resp. green) is harmful (resp. harmless). Dashed red rectangles highlights content activating the safety mechanism within the pre-aligned LLMs in Sec. \ref{['sec_safegen']}.
  • ...and 14 more figures