A Framework for Evaluating Vision-Language Model Safety: Building Trust in AI for Public Sector Applications
Maisha Binte Rashid, Pablo Rivas
TL;DR
The paper tackles the need for robust safety evaluation of vision-language models in public sector applications by proposing a unified vulnerability framework. It analyzes model robustness under Gaussian, $Salt$-$and$-$Pepper$, and $Uniform$ noise, derives composite noise patches and saliency-based perturbations, and benchmarks against the FGSM adversarial attack. A key contribution is the Vulnerability Score, defined as $VS = w_1 \cdot \text{Noise Impact Score} + w_2 \cdot \text{FGSM Impact Score}$ with $w_1+w_2=1$, plus the individual impact measures $\text{Noise Impact Score} = \frac{\text{Acc}_{\text{Baseline}}-\text{Acc}_{\text{Noise}}}{\text{Acc}_{\text{Baseline}}} \times 100$ and $\text{FGSM Impact Score} = \frac{\text{Acc}_{\text{Baseline}}-\text{Acc}_{\text{FGSM}}}{\text{Acc}_{\text{Baseline}}} \times 100$. The approach is demonstrated on a CLIP-based system using the Caltech-256 dataset, with 300 images, revealing substantial vulnerability to both random and adversarial perturbations and showing the utility of the proposed metric for public-sector decision-making. Overall, the framework offers a practical, scalable tool to quantify and compare robustness of multimodal models, aiding governance and deployment of trustworthy AI in mission-critical domains.
Abstract
Vision-Language Models (VLMs) are increasingly deployed in public sector missions, necessitating robust evaluation of their safety and vulnerability to adversarial attacks. This paper introduces a novel framework to quantify adversarial risks in VLMs. We analyze model performance under Gaussian, salt-and-pepper, and uniform noise, identifying misclassification thresholds and deriving composite noise patches and saliency patterns that highlight vulnerable regions. These patterns are compared against the Fast Gradient Sign Method (FGSM) to assess their adversarial effectiveness. We propose a new Vulnerability Score that combines the impact of random noise and adversarial attacks, providing a comprehensive metric for evaluating model robustness.
