Table of Contents
Fetching ...

Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Md Zarif Hossain, Ahmed Imteaj

TL;DR

Sim-CLIP, an unsupervised adversarial fine-tuning method that enhances the robustness of the widely-used CLIP vision encoder against such attacks while maintaining semantic richness and specificity, is proposed.

Abstract

Vision-language models (VLMs) have achieved significant strides in recent times specially in multimodal tasks, yet they remain susceptible to adversarial attacks on their vision components. To address this, we propose Sim-CLIP, an unsupervised adversarial fine-tuning method that enhances the robustness of the widely-used CLIP vision encoder against such attacks while maintaining semantic richness and specificity. By employing a Siamese architecture with cosine similarity loss, Sim-CLIP learns semantically meaningful and attack-resilient visual representations without requiring large batch sizes or momentum encoders. Our results demonstrate that VLMs enhanced with Sim-CLIP's fine-tuned CLIP encoder exhibit significantly enhanced robustness against adversarial attacks, while preserving semantic meaning of the perturbed images. Notably, Sim-CLIP does not require additional training or fine-tuning of the VLM itself; replacing the original vision encoder with our fine-tuned Sim-CLIP suffices to provide robustness. This work underscores the significance of reinforcing foundational models like CLIP to safeguard the reliability of downstream VLM applications, paving the way for more secure and effective multimodal systems.

Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

TL;DR

Sim-CLIP, an unsupervised adversarial fine-tuning method that enhances the robustness of the widely-used CLIP vision encoder against such attacks while maintaining semantic richness and specificity, is proposed.

Abstract

Vision-language models (VLMs) have achieved significant strides in recent times specially in multimodal tasks, yet they remain susceptible to adversarial attacks on their vision components. To address this, we propose Sim-CLIP, an unsupervised adversarial fine-tuning method that enhances the robustness of the widely-used CLIP vision encoder against such attacks while maintaining semantic richness and specificity. By employing a Siamese architecture with cosine similarity loss, Sim-CLIP learns semantically meaningful and attack-resilient visual representations without requiring large batch sizes or momentum encoders. Our results demonstrate that VLMs enhanced with Sim-CLIP's fine-tuned CLIP encoder exhibit significantly enhanced robustness against adversarial attacks, while preserving semantic meaning of the perturbed images. Notably, Sim-CLIP does not require additional training or fine-tuning of the VLM itself; replacing the original vision encoder with our fine-tuned Sim-CLIP suffices to provide robustness. This work underscores the significance of reinforcing foundational models like CLIP to safeguard the reliability of downstream VLM applications, paving the way for more secure and effective multimodal systems.
Paper Structure (22 sections, 7 equations, 3 figures, 3 tables)

This paper contains 22 sections, 7 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Targeted $\ell_\infty$ attack at $\epsilon=2/255$ radii using original CLIP model as vision encoder in LLaVA. Original images with their captions on left and captions generated by LLaVA for benign images on right.
  • Figure 2: Workflow and overview of proposed Sim-CLIP. CLIP undergoes adversarial training using our proposed Sim-CLIP approach and serves as the vision encoder in the Vision-Language Model. The workflow includes image perturbation for inference and pre-training CLIP on ImageNet using Sim-CLIP for enhanced robustness. CLIP model undergoes adversarial fine-tuning while the text encoder is kept frozen.
  • Figure 3: Targeted $\ell_\infty$ attacks at $\epsilon=4/255$ radii using original and robust CLIP models as vision encoder in LLaVA. Considering the target strings from Table \ref{['tab:targetedquant']}, we present generated captions (good caption, captions with mistakes, captions missing intricate details, malicious target output) on original (left) and imperceptible adversarial (right) images.