Cross-Modal Safety Alignment: Is textual unlearning all you need?

Trishna Chakraborty; Erfan Shayegani; Zikui Cai; Nael Abu-Ghazaleh; M. Salman Asif; Yue Dong; Amit K. Roy-Chowdhury; Chengyu Song

Cross-Modal Safety Alignment: Is textual unlearning all you need?

Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit K. Roy-Chowdhury, Chengyu Song

TL;DR

This work investigates cross-modality safety alignment for vision-language models by evaluating textual unlearning as a scalable defense. By updating only the LLM in a VLM, textual unlearning significantly reduces harmful outputs across text-only and vision-text attacks while preserving normal functionality, achieving ASR below $8\%$ and sometimes near $2\%$, and doing so with roughly one-sixth of the computational cost of multimodal defenses. The study compares textual unlearning, multimodal unlearning, and multimodal SFT across diverse datasets, finding that textual unlearning often outperforms or matches alternatives while being far more efficient; dataset harm coverage in the textual domain appears pivotal for cross-modality generalization. These findings suggest a practical, scalable path for safer multi-modal systems, emphasizing harm coverage and computational efficiency, and highlighting the need to explore larger models and additional modalities in future work.

Abstract

Recent studies reveal that integrating new modalities into Large Language Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack surface that bypasses existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where, regardless of the combination of input modalities, all inputs are ultimately fused into the language space, we aim to explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our evaluation across six datasets empirically demonstrates the transferability -- textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8\% and in some cases, even as low as nearly 2\% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands, possibly up to 6 times higher.

Cross-Modal Safety Alignment: Is textual unlearning all you need?

TL;DR

and sometimes near

, and doing so with roughly one-sixth of the computational cost of multimodal defenses. The study compares textual unlearning, multimodal unlearning, and multimodal SFT across diverse datasets, finding that textual unlearning often outperforms or matches alternatives while being far more efficient; dataset harm coverage in the textual domain appears pivotal for cross-modality generalization. These findings suggest a practical, scalable path for safer multi-modal systems, emphasizing harm coverage and computational efficiency, and highlighting the need to explore larger models and additional modalities in future work.

Abstract

Paper Structure (29 sections, 10 equations, 1 figure, 9 tables)

This paper contains 29 sections, 10 equations, 1 figure, 9 tables.

Introduction
Background
Multimodal Large Language Models (MLLMs).
Safety Alignment.
Challenges in Cross-Modality Safety.
Machine Unlearning.
Methodology
Notations.
Unlearning.
Textual Unlearning.
Multi-Modal Unlearning.
Multi-Modal SFT.
Experiments
Experimental setup
Datasets.
...and 14 more sections

Figures (1)

Figure 1: (A) Overview of our settings: Multi-modal SFT (Supervised Fine-Tuning), multi-modal unlearning, and textual unlearning: In all the experiments, only the LLM is updated and the rest of the VLM components are frozen --- textual unlearning outperforms the other two in both effectiveness and computational efficiency. (B) With added modalities, the input embedding space expands significantly, making it unlikely for SFT-based approaches to generalize effectively. As a result, some inputs are likely to bypass SFT defenses. Our approach, which involves textual unlearning, modifies the language modeling objective of the LLM to avoid generating undesired content when given harmful context, regardless of the input modalities.

Cross-Modal Safety Alignment: Is textual unlearning all you need?

TL;DR

Abstract

Cross-Modal Safety Alignment: Is textual unlearning all you need?

Authors

TL;DR

Abstract

Table of Contents

Figures (1)