Table of Contents
Fetching ...

Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

Qin Liu, Chao Shang, Ling Liu, Nikolaos Pappas, Jie Ma, Neha Anna John, Srikanth Doss, Lluis Marquez, Miguel Ballesteros, Yassine Benajiba

TL;DR

This paper introduces Cross-Modality Representation Manipulation (CMRM), an inference time representation intervention method for recovering the safety alignment ability that is inherent in the LLM backbone of VLMs, while simultaneously preserving the functional capabilities of VLMs.

Abstract

The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module compared to its LLM backbone. We investigate this phenomenon, dubbed as ''safety alignment degradation'' in this paper, and show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs. In particular, we show that the representations of multi-modal inputs shift away from that of text-only inputs which represent the distribution that the LLM backbone is optimized for. At the same time, the safety alignment capabilities, initially developed within the textual embedding space, do not successfully transfer to this new multi-modal representation space. To reduce safety alignment degradation, we introduce Cross-Modality Representation Manipulation (CMRM), an inference time representation intervention method for recovering the safety alignment ability that is inherent in the LLM backbone of VLMs, while simultaneously preserving the functional capabilities of VLMs. The empirical results show that our framework significantly recovers the alignment ability that is inherited from the LLM backbone with minimal impact on the fluency and linguistic capabilities of pre-trained VLMs even without additional training. Specifically, the unsafe rate of LLaVA-7B on multi-modal input can be reduced from 61.53% to as low as 3.15% with only inference-time intervention. WARNING: This paper contains examples of toxic or harmful language.

Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

TL;DR

This paper introduces Cross-Modality Representation Manipulation (CMRM), an inference time representation intervention method for recovering the safety alignment ability that is inherent in the LLM backbone of VLMs, while simultaneously preserving the functional capabilities of VLMs.

Abstract

The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module compared to its LLM backbone. We investigate this phenomenon, dubbed as ''safety alignment degradation'' in this paper, and show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs. In particular, we show that the representations of multi-modal inputs shift away from that of text-only inputs which represent the distribution that the LLM backbone is optimized for. At the same time, the safety alignment capabilities, initially developed within the textual embedding space, do not successfully transfer to this new multi-modal representation space. To reduce safety alignment degradation, we introduce Cross-Modality Representation Manipulation (CMRM), an inference time representation intervention method for recovering the safety alignment ability that is inherent in the LLM backbone of VLMs, while simultaneously preserving the functional capabilities of VLMs. The empirical results show that our framework significantly recovers the alignment ability that is inherited from the LLM backbone with minimal impact on the fluency and linguistic capabilities of pre-trained VLMs even without additional training. Specifically, the unsafe rate of LLaVA-7B on multi-modal input can be reduced from 61.53% to as low as 3.15% with only inference-time intervention. WARNING: This paper contains examples of toxic or harmful language.

Paper Structure

This paper contains 29 sections, 6 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Visualization of three models' hidden states upon five variations of input using 2-dimensional PCA. The first and second rows are visualized with the VLSafe dataset and manipulated JailbreakLLMs dataset, respectively. The representations of pure textual input (text + caption and text query only) and multi-modal input (original, text + blank image, and gaussian noise) are significantly separable, especially for VLSafe dataset (the first row).
  • Figure 2: Sensitivity analysis on alpha values for dataset-level CMRM. We show the safety performance of LLaVA 7B and ShareGPT on two datasets with varying coefficients. Generally, an alpha value of $1.0$ results in a lower unsafe rate.
  • Figure 3: Visualization of hidden states from the top layer of LLaVA-7B on VLSafe dataset under CMRM. Dashed lines with numbers denote the distance between cluster centers. With the intervention of CMRM, the representations of vanilla input (yellow circles) are pulled closer to the cluster of hidden states upon pure textual input (blue crosses), resulting in purple triangles. However, a high alpha value (e.g. $2.0$) pushes the hidden states too far, which in turn hurts VLMs' general ability.