Table of Contents
Fetching ...

As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?

Anjun Hu, Jindong Gu, Francesco Pinto, Konstantinos Kamnitsas, Philip Torr

TL;DR

This work investigates whether open-source foundation models like CLIP propagate adversarial vulnerabilities to downstream vision-language tasks. It introduces Patch Representation Misalignment (PRM), a cross-task attack that perturbations input to distort intermediate CLIP representations via a patch-wise cosine-similarity objective, formalized as $L_{PRM} = \sum_{l\in L} \sum_{p=0}^{\lceil HW/d^2 \rceil} \frac{f^p_l \cdot f'^p_l}{\|f^p_l\| \|f'^p_l\|}$. Using only publicly available CLIP vision encoders as surrogates, PRM yields substantial transfer to more than 20 downstream models across four tasks (OVS, OVD, IC, VQA), outperforming task-specific and cross-task baselines. These results reveal a significant safety risk: foundation-model-based vulnerabilities can propagate to diverse downstream systems, underscoring the need for defense strategies and robust training approaches in open-source foundation-model deployments.

Abstract

Foundation models pre-trained on web-scale vision-language data, such as CLIP, are widely used as cornerstones of powerful machine learning systems. While pre-training offers clear advantages for downstream learning, it also endows downstream models with shared adversarial vulnerabilities that can be easily identified through the open-sourced foundation model. In this work, we expose such vulnerabilities in CLIP's downstream models and show that foundation models can serve as a basis for attacking their downstream systems. In particular, we propose a simple yet effective adversarial attack strategy termed Patch Representation Misalignment (PRM). Solely based on open-sourced CLIP vision encoders, this method produces adversaries that simultaneously fool more than 20 downstream models spanning 4 common vision-language tasks (semantic segmentation, object detection, image captioning and visual question-answering). Our findings highlight the concerning safety risks introduced by the extensive usage of public foundational models in the development of downstream systems, calling for extra caution in these scenarios.

As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?

TL;DR

This work investigates whether open-source foundation models like CLIP propagate adversarial vulnerabilities to downstream vision-language tasks. It introduces Patch Representation Misalignment (PRM), a cross-task attack that perturbations input to distort intermediate CLIP representations via a patch-wise cosine-similarity objective, formalized as . Using only publicly available CLIP vision encoders as surrogates, PRM yields substantial transfer to more than 20 downstream models across four tasks (OVS, OVD, IC, VQA), outperforming task-specific and cross-task baselines. These results reveal a significant safety risk: foundation-model-based vulnerabilities can propagate to diverse downstream systems, underscoring the need for defense strategies and robust training approaches in open-source foundation-model deployments.

Abstract

Foundation models pre-trained on web-scale vision-language data, such as CLIP, are widely used as cornerstones of powerful machine learning systems. While pre-training offers clear advantages for downstream learning, it also endows downstream models with shared adversarial vulnerabilities that can be easily identified through the open-sourced foundation model. In this work, we expose such vulnerabilities in CLIP's downstream models and show that foundation models can serve as a basis for attacking their downstream systems. In particular, we propose a simple yet effective adversarial attack strategy termed Patch Representation Misalignment (PRM). Solely based on open-sourced CLIP vision encoders, this method produces adversaries that simultaneously fool more than 20 downstream models spanning 4 common vision-language tasks (semantic segmentation, object detection, image captioning and visual question-answering). Our findings highlight the concerning safety risks introduced by the extensive usage of public foundational models in the development of downstream systems, calling for extra caution in these scenarios.
Paper Structure (39 sections, 1 equation, 6 figures, 9 tables, 1 algorithm)

This paper contains 39 sections, 1 equation, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Given a clean image (top left), attackers can leverage the open-sourced CLIP vision encoder to find imperceptible input perturbations (bottom left, magnified by 30$\times$ for visualisation) that distort CLIP's intermediate features. These perturbations are added to the original image to construct an adversarial sample that can simultaneously fool many downstream models intended for various tasks: downstream models that are highly performant on clean samples (top row) suffer significant performance degradation (bottom row) under such attacks.
  • Figure 2: Overview of our attack pipeline. A normal forward pass with clean input is marked in green whereas the forward pass of the adversarial sample is marked in red. Dashed line indicates the flow of loss gradients which are used to update the injected adversarial perturbation. The loss objective minimises the cosine similarity between the adversarial representation of each patch (token) $f'$ and its clean counterpart $f$ along the embedding (ViT) or channel (CNN) dimension of the features. This approach individually diverts each patch representation (indicated by the reversed intensity of the top-left patch representation) to induce semantic distortions in all image regions.
  • Figure 3: Normalised target model performance (model performance metrics under adversarial attacks divided by metrics on clean samples) of various attack strategies. Left: using ViT-B/16 (or task-specific baselines that use ViT-B/16 backbone) as surrogates. Right: using ConvNeXt-L as surrogates. PRM (red line) outperforms baseline methods by a significant margin across all tasks with both surrogate choices. The radius of the outer circles represents model performance on clean samples (unitary normalised metrics). Each attack strategy corresponds to a line. Each task is indicated by a differently coloured sector (datasets and metrics used for each task are detailed in the $2^\mathrm{nd}$ line of the legend). Target model names are annotated on the periphery of the circles. White-box scenarios in surrogate loss maximisation baselines are excluded.
  • Figure 4: An adversarial example created via PRM (using ViT-B/16 CLIP vision encoder as a surrogate) can fool various downstream models across various tasks. Clean ($x$) and adversarial ($x'$) inputs are shown in the top-left corner. Correct predictions are marked with green frames whereas adversarial predictions are marked with red. Note that downstream models tend to make semantically consistent mistakes (i.e. perceiving a false positive human in the scene).
  • Figure 5: Another adversarial example created with PRM method using ViT-B/16 CLIP vision encoder as a surrogate. Clean ($x$) and adversarial ($x'$) inputs are shown in the top-left corner. Correct predictions are marked with green frames whereas adversarial predictions are marked with red. Curiously, as exemplified by this sample, OVD target models frequently make false positive predictions of persons, which is likely due to biases in the training data dataset-lin2014microsoftcoco where person is the highest frequency class.
  • ...and 1 more figures