Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models
Yi Ding, Lijun Li, Bing Cao, Jing Shao
TL;DR
The paper identifies key bottlenecks in safety fine-tuning of Vision-Language Models, notably a lack of safety-related visual reasoning, and introduces MIS, a multi-image safety dataset, to address this gap. It proposes MIRage, a reasoning-informed fine-tuning framework that leverages safety CoT prompts to enhance multi-image safety while maintaining general capabilities. Empirical results show that MIS challenges both open-source and API models, and that MIRage dramatically reduces attack success and hallucinations across diverse models and benchmarks, with strong generalization to unseen safety categories. The work highlights the vulnerability of synthetic imagery to safety breakdowns and emphasizes the value of explicit visual reasoning for robust safety in complex multimodal contexts.
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable performance across a wide range of tasks. However, their deployment in safety-critical domains poses significant challenges. Existing safety fine-tuning methods, which focus on textual or multimodal content, fall short in addressing challenging cases or disrupt the balance between helpfulness and harmlessness. Our evaluation highlights a safety reasoning gap: these methods lack safety visual reasoning ability, leading to such bottlenecks. To address this limitation and enhance both visual perception and reasoning in safety-critical contexts, we propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Specifically, we introduce the Multi-Image Safety (MIS) dataset, an instruction-following dataset tailored for multi-image safety scenarios, consisting of training and test splits. Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks requiring safety-related visual reasoning. This approach not only delivers exceptional safety performance but also preserves general capabilities without any trade-offs. Specifically, fine-tuning with MIS increases average accuracy by 0.83% across five general benchmarks and reduces the Attack Success Rate (ASR) on multiple safety benchmarks by a large margin.
