ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
Yi Ding, Bolian Li, Ruqi Zhang
TL;DR
ETA introduces a no-training, inference-time safety framework for Vision-Language Models by jointly evaluating multimodal inputs and outputs and then aligning unsafe generations through a two-step process: shallow interference-prefix prompting and deep sentence-level best-of-N selection guided by multimodal evaluators. The approach uses CLIP-based input safety scoring and a safety-focused reward model to detect unsafe content, applying alignment only when both evaluators flag risk. Empirical results show large reductions in unsafe responses across diverse backbones and tasks, with notable gains in helpfulness and only modest increases in latency, outperforming prior methods like ECSO and fine-tuning baselines. This work demonstrates that addressing the continuous nature of visual embeddings and combining evaluation with bi-level alignment can significantly enhance safety while preserving model utility, offering a practical, plug-and-play solution for real-world VLM deployment.
Abstract
Vision Language Models (VLMs) have become essential backbones for multimodal intelligence, yet significant safety challenges limit their real-world application. While textual inputs are often effectively safeguarded, adversarial visual inputs can easily bypass VLM defense mechanisms. Existing defense methods are either resource-intensive, requiring substantial data and compute, or fail to simultaneously ensure safety and usefulness in responses. To address these limitations, we propose a novel two-phase inference-time alignment framework, Evaluating Then Aligning (ETA): 1) Evaluating input visual contents and output responses to establish a robust safety awareness in multimodal settings, and 2) Aligning unsafe behaviors at both shallow and deep levels by conditioning the VLMs' generative distribution with an interference prefix and performing sentence-level best-of-N to search the most harmless and helpful generation paths. Extensive experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency, reducing the unsafe rate by 87.5% in cross-modality attacks and achieving 96.6% win-ties in GPT-4 helpfulness evaluation. The code is publicly available at https://github.com/DripNowhy/ETA.
