CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models
Xiangyu Yin, Jiaxu Liu, Zhen Chen, Jinwei Hu, Yi Dong, Xiaowei Huang, Wenjie Ruan
TL;DR
CeTAD addresses the safety of vision-language models against visual jailbreaks by introducing a toxicity-aware distance and a regressed certification framework based on randomized smoothing. It combines a toxicity-aware distance $\mathcal{D}$, defined as $\mathcal{D} = 1 - (\lambda \Gamma(r) + (1-\lambda) C(r, r_t))$, with Gaussian and Laplacian smoothing to certify the probability that the distance exceeds a threshold, i.e., $\mathbb{P}(\mathbb{E}_{r}[\mathcal{D}] \ge \epsilon) \ge P$, yielding an adaptive robust radius under the $\ell_2$-norm and $\ell_1$-norm. It also presents sample-size reduction strategies via Clopper-Pearson intervals and Bayesian bounds to maintain certification quality with fewer samples, and provides empirical validation on MiniGPT-4 with ImageNet prompts and harmful instructions. The results demonstrate formal guarantees against both pixel- and structure-level perturbations, advancing trustworthy deployment of VLMs in potentially safety-critical scenarios.
Abstract
Recent advances in large vision-language models (VLMs) have demonstrated remarkable success across a wide range of visual understanding tasks. However, the robustness of these models against jailbreak attacks remains an open challenge. In this work, we propose a universal certified defence framework to safeguard VLMs rigorously against potential visual jailbreak attacks. First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses, capturing subtle differences often overlooked by conventional cosine similarity-based measures. Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees against both adversarial and structural perturbations, even under black-box settings. Complementing this, our feature-space defence introduces noise distributions (e.g., Gaussian, Laplacian) into the latent embeddings to safeguard against both pixel-level and structure-level perturbations. Our results highlight the potential of a formally grounded, integrated strategy toward building more resilient and trustworthy VLMs.
