Table of Contents
Fetching ...

CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models

Xiangyu Yin, Jiaxu Liu, Zhen Chen, Jinwei Hu, Yi Dong, Xiaowei Huang, Wenjie Ruan

TL;DR

CeTAD addresses the safety of vision-language models against visual jailbreaks by introducing a toxicity-aware distance and a regressed certification framework based on randomized smoothing. It combines a toxicity-aware distance $\mathcal{D}$, defined as $\mathcal{D} = 1 - (\lambda \Gamma(r) + (1-\lambda) C(r, r_t))$, with Gaussian and Laplacian smoothing to certify the probability that the distance exceeds a threshold, i.e., $\mathbb{P}(\mathbb{E}_{r}[\mathcal{D}] \ge \epsilon) \ge P$, yielding an adaptive robust radius under the $\ell_2$-norm and $\ell_1$-norm. It also presents sample-size reduction strategies via Clopper-Pearson intervals and Bayesian bounds to maintain certification quality with fewer samples, and provides empirical validation on MiniGPT-4 with ImageNet prompts and harmful instructions. The results demonstrate formal guarantees against both pixel- and structure-level perturbations, advancing trustworthy deployment of VLMs in potentially safety-critical scenarios.

Abstract

Recent advances in large vision-language models (VLMs) have demonstrated remarkable success across a wide range of visual understanding tasks. However, the robustness of these models against jailbreak attacks remains an open challenge. In this work, we propose a universal certified defence framework to safeguard VLMs rigorously against potential visual jailbreak attacks. First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses, capturing subtle differences often overlooked by conventional cosine similarity-based measures. Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees against both adversarial and structural perturbations, even under black-box settings. Complementing this, our feature-space defence introduces noise distributions (e.g., Gaussian, Laplacian) into the latent embeddings to safeguard against both pixel-level and structure-level perturbations. Our results highlight the potential of a formally grounded, integrated strategy toward building more resilient and trustworthy VLMs.

CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models

TL;DR

CeTAD addresses the safety of vision-language models against visual jailbreaks by introducing a toxicity-aware distance and a regressed certification framework based on randomized smoothing. It combines a toxicity-aware distance , defined as , with Gaussian and Laplacian smoothing to certify the probability that the distance exceeds a threshold, i.e., , yielding an adaptive robust radius under the -norm and -norm. It also presents sample-size reduction strategies via Clopper-Pearson intervals and Bayesian bounds to maintain certification quality with fewer samples, and provides empirical validation on MiniGPT-4 with ImageNet prompts and harmful instructions. The results demonstrate formal guarantees against both pixel- and structure-level perturbations, advancing trustworthy deployment of VLMs in potentially safety-critical scenarios.

Abstract

Recent advances in large vision-language models (VLMs) have demonstrated remarkable success across a wide range of visual understanding tasks. However, the robustness of these models against jailbreak attacks remains an open challenge. In this work, we propose a universal certified defence framework to safeguard VLMs rigorously against potential visual jailbreak attacks. First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses, capturing subtle differences often overlooked by conventional cosine similarity-based measures. Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees against both adversarial and structural perturbations, even under black-box settings. Complementing this, our feature-space defence introduces noise distributions (e.g., Gaussian, Laplacian) into the latent embeddings to safeguard against both pixel-level and structure-level perturbations. Our results highlight the potential of a formally grounded, integrated strategy toward building more resilient and trustworthy VLMs.

Paper Structure

This paper contains 26 sections, 4 theorems, 39 equations, 4 figures, 6 tables.

Key Result

Lemma 3.3

Let $\mathbf{x}\sim\mathcal{N}(\mathbf{m}_1, \sigma^2\mathbf{I})$, $\mathbf{y}\sim\mathcal{N}(\mathbf{m}_2, \sigma^2\mathbf{I})$, and $f:\mathbb{R}^d\rightarrow\{0, 1\}$ be any deterministic or random mapping function. We have:

Figures (4)

  • Figure 1: Using the universal visual prompt described in Appx.\ref{['justification']}, we randomly select 6 textual prompts from a set of harmful instructions manually crafted in qi2024visual. These bi-modal prompts are used to query MiniGPT-4 (Vicuna-13B). For each bi-modal prompt, five clean responses ($R_1$-$R_5$) are generated. Then, using the jailbreak method outlined in qi2024visual, we generate 15 adversarial responses ($R_6$-$R_{10}$ and $TR_{1}$-$TR_{10}$). Visual features are extracted using BLIP-2 li2023blip2bootstrappinglanguageimagepretraining, and both cosine similarity-based and toxicity-aware distances are computed. Detailed responses are provided in Appx. \ref{['justification']}.
  • Figure 2: Probabilistic certificate $P$ under Gaussian noise with standard deviations $\sigma$=1.0, 5.0, 10.0. for each prompt $t_h^{i}$ (where $i$=1, $\cdots$, 10). The x-axis represents the threshold $\epsilon$.
  • Figure 3: Probabilistic certificate $P$ under Laplacian noise with standard deviations $\sigma$=1.0, 5.0, 10.0. for each prompt $t_h^{i}$ (where $i$=1, $\cdots$, 10). The x-axis represents the threshold $\epsilon$.
  • Figure :

Theorems & Definitions (6)

  • Definition 3.1: $\mathbf{r}_{N}^{t_h}$-Targeted Distance
  • Definition 3.2: Randomly Smoothed Probabilistic Certificate
  • Lemma 3.3: neyman1933ix for Isotropic Gaussians
  • Lemma 3.4: Adaptively Certified $\ell_2$ Robust Radius
  • Lemma 3.5: teng2020ell_1 for Isotropic Laplacians
  • Lemma 3.6: Certified $\ell_1$ Robust Radius