Table of Contents
Fetching ...

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Takáč, Pascal Fua, Karthik Nandakumar, Ivan Laptev

TL;DR

This work proposes a novel, yet elegantly simple approach for detecting adversarial samples in VLMs that leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs and extends its methodology to classification tasks, showcasing its adaptability and model-agnostic nature.

Abstract

Vision-Language Models (VLMs) are becoming increasingly vulnerable to adversarial attacks as various novel attack strategies are being proposed against these models. While existing defenses excel in unimodal contexts, they currently fall short in safeguarding VLMs against adversarial threats. To mitigate this vulnerability, we propose a novel, yet elegantly simple approach for detecting adversarial samples in VLMs. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Subsequently, we calculate the similarities of the embeddings of both input and generated images in the feature space to identify adversarial samples. Empirical evaluations conducted on different datasets validate the efficacy of our approach, outperforming baseline methods adapted from image classification domains. Furthermore, we extend our methodology to classification tasks, showcasing its adaptability and model-agnostic nature. Theoretical analyses and empirical findings also show the resilience of our approach against adaptive attacks, positioning it as an excellent defense mechanism for real-world deployment against adversarial threats.

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

TL;DR

This work proposes a novel, yet elegantly simple approach for detecting adversarial samples in VLMs that leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs and extends its methodology to classification tasks, showcasing its adaptability and model-agnostic nature.

Abstract

Vision-Language Models (VLMs) are becoming increasingly vulnerable to adversarial attacks as various novel attack strategies are being proposed against these models. While existing defenses excel in unimodal contexts, they currently fall short in safeguarding VLMs against adversarial threats. To mitigate this vulnerability, we propose a novel, yet elegantly simple approach for detecting adversarial samples in VLMs. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Subsequently, we calculate the similarities of the embeddings of both input and generated images in the feature space to identify adversarial samples. Empirical evaluations conducted on different datasets validate the efficacy of our approach, outperforming baseline methods adapted from image classification domains. Furthermore, we extend our methodology to classification tasks, showcasing its adaptability and model-agnostic nature. Theoretical analyses and empirical findings also show the resilience of our approach against adaptive attacks, positioning it as an excellent defense mechanism for real-world deployment against adversarial threats.
Paper Structure (27 sections, 6 equations, 7 figures, 28 tables, 1 algorithm)

This paper contains 27 sections, 6 equations, 7 figures, 28 tables, 1 algorithm.

Figures (7)

  • Figure 1: MirrorCheck approach. At inference time, to check if an input image has been adversarially attacked, our framework follows this procedure: (1) generates the text description for the image. (2) use this caption to regenerate the image with a text-to-image model. (3) extract and compare embeddings from both the original and regenerated images using a feature extractor. If the embeddings significantly differ, the original image likely suffered an attack. The intuition behind our method is that if the input was attacked, the image and the caption would not be semantically consistent. Therefore, using the predicted caption as a prompt for image generation would result in an image that is significantly semantically different.
  • Figure 2: An example using our MirrorCheck framework. For both Clean and adversarial (Adv) cases, we use the BLIP model to generate captions for the given images. Stable Diffusion then generates images based on these captions. For the clean image, different image encoders show high similarity between the input image and the generated one. Conversely, when the input image is adversarial, different image encoders show low similarity.
  • Figure 3: Effect of our ensemble approach on a victim model (Case study: UniDiffuser).
  • Figure 4: Visual results using BLIP (Victim Model) and Stable Diffusion (T2I Model). On the left are the images generated using the adversarial images+texts and on the right are the images generated using the clean images+texts.
  • Figure 5: We carry out ablations to observe the performance of our approach, MirrorCheck, when we replace our baseline T2I Model (Stable Diffusion) with UniDiffuser (UD) and ControlNet (CN). We then compare our detection accuracies with baselines (Feature Squeezing (FSFeatureSqueeze), MagNet (MN) magnet, PuVAE (PV) puvae). Detailed results can be found in Appendices \ref{['ablation1']}, \ref{['pyapp']}, and \ref{['openapp']}. Key Takeaway: Across different T2I models, MirrorCheck consistently surpasses all baseline methods.
  • ...and 2 more figures