Table of Contents
Fetching ...

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

Yi Ding, Bolian Li, Ruqi Zhang

TL;DR

ETA introduces a no-training, inference-time safety framework for Vision-Language Models by jointly evaluating multimodal inputs and outputs and then aligning unsafe generations through a two-step process: shallow interference-prefix prompting and deep sentence-level best-of-N selection guided by multimodal evaluators. The approach uses CLIP-based input safety scoring and a safety-focused reward model to detect unsafe content, applying alignment only when both evaluators flag risk. Empirical results show large reductions in unsafe responses across diverse backbones and tasks, with notable gains in helpfulness and only modest increases in latency, outperforming prior methods like ECSO and fine-tuning baselines. This work demonstrates that addressing the continuous nature of visual embeddings and combining evaluation with bi-level alignment can significantly enhance safety while preserving model utility, offering a practical, plug-and-play solution for real-world VLM deployment.

Abstract

Vision Language Models (VLMs) have become essential backbones for multimodal intelligence, yet significant safety challenges limit their real-world application. While textual inputs are often effectively safeguarded, adversarial visual inputs can easily bypass VLM defense mechanisms. Existing defense methods are either resource-intensive, requiring substantial data and compute, or fail to simultaneously ensure safety and usefulness in responses. To address these limitations, we propose a novel two-phase inference-time alignment framework, Evaluating Then Aligning (ETA): 1) Evaluating input visual contents and output responses to establish a robust safety awareness in multimodal settings, and 2) Aligning unsafe behaviors at both shallow and deep levels by conditioning the VLMs' generative distribution with an interference prefix and performing sentence-level best-of-N to search the most harmless and helpful generation paths. Extensive experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency, reducing the unsafe rate by 87.5% in cross-modality attacks and achieving 96.6% win-ties in GPT-4 helpfulness evaluation. The code is publicly available at https://github.com/DripNowhy/ETA.

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

TL;DR

ETA introduces a no-training, inference-time safety framework for Vision-Language Models by jointly evaluating multimodal inputs and outputs and then aligning unsafe generations through a two-step process: shallow interference-prefix prompting and deep sentence-level best-of-N selection guided by multimodal evaluators. The approach uses CLIP-based input safety scoring and a safety-focused reward model to detect unsafe content, applying alignment only when both evaluators flag risk. Empirical results show large reductions in unsafe responses across diverse backbones and tasks, with notable gains in helpfulness and only modest increases in latency, outperforming prior methods like ECSO and fine-tuning baselines. This work demonstrates that addressing the continuous nature of visual embeddings and combining evaluation with bi-level alignment can significantly enhance safety while preserving model utility, offering a practical, plug-and-play solution for real-world VLM deployment.

Abstract

Vision Language Models (VLMs) have become essential backbones for multimodal intelligence, yet significant safety challenges limit their real-world application. While textual inputs are often effectively safeguarded, adversarial visual inputs can easily bypass VLM defense mechanisms. Existing defense methods are either resource-intensive, requiring substantial data and compute, or fail to simultaneously ensure safety and usefulness in responses. To address these limitations, we propose a novel two-phase inference-time alignment framework, Evaluating Then Aligning (ETA): 1) Evaluating input visual contents and output responses to establish a robust safety awareness in multimodal settings, and 2) Aligning unsafe behaviors at both shallow and deep levels by conditioning the VLMs' generative distribution with an interference prefix and performing sentence-level best-of-N to search the most harmless and helpful generation paths. Extensive experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency, reducing the unsafe rate by 87.5% in cross-modality attacks and achieving 96.6% win-ties in GPT-4 helpfulness evaluation. The code is publicly available at https://github.com/DripNowhy/ETA.

Paper Structure

This paper contains 65 sections, 8 equations, 12 figures, 13 tables, 1 algorithm.

Figures (12)

  • Figure 1: ETA framework overview. ETA uses a multimodal evaluator to assess visual inputs with the CLIP score and initial generated responses with a textual reward model. For instances flagged as unsafe, ETA implements a comprehensive alignment process, which consists of both shallow alignment (interference prefix) and deep alignment (sentence-level best-of-$N$ searching).
  • Figure 2: Continuous visual token embeddings can bypass existing safety mechanisms that are primarily aligned with discrete textual token embeddings. To verify this hypothesis, we implemented a mapping that transforms continuous visual embeddings to their nearest discrete textual embeddings based on cosine similarity. This mapping results in a significant 7% reduction in the unsafe rate (USR) when evaluated on the SPA-VL Harm test set zhang2024spa (We report more results on four VLM baselines and two datasets in Appendix \ref{['appendix:more_motivation']}). Fig. \ref{['fig:img2txt_cosine']} illustrates examples of these mapped textual tokens, demonstrating how offensive images are transformed into harmful tokens that can then be effectively addressed by the original safety mechanisms of LLM backbones.
  • Figure 3: Empirical effectiveness of ETA. (a) Unsafe rate (USR) on the SPA-VL Harm dataset. The red curve illustrates the safety degradation of LLM backbones due to visual modality fine-tuning and input; the green curve demonstrates the safety improvements achieved by ETA. (b) $\mathcal{S}_{\text{pre}}$ distribution (Eq. \ref{['equ:pre']}) on 100 safe and unsafe images sampled from COCO and MM-SafetyBench, respectively. $\mathcal{S}_{\text{pre}}$ demonstrates effective separation between safe and unsafe images.
  • Figure 4: Reward distribution comparison on difference input format. It is evident from the distribution and KL divergence data in the figure that our proposed safety-specific input format better distinguishes between safe and unsafe responses.
  • Figure 5: Helpfulness evaluation on the SPA-VL Help test set shows that ETA outperforms other methods in the GPT-4-Turbo evaluated win-tie-lose rate, demonstrating its superior ability to align responses with human preferences.
  • ...and 7 more figures