Table of Contents
Fetching ...

Towards Predicting the Success of Transfer-based Attacks by Quantifying Shared Feature Representations

Ashley S. Dale, Mei Qiu, Foo Bin Che, Thomas Bsaibes, Lauren Christopher, Paul Salama

TL;DR

The paper tackles predicting transfer-based attack (TBA) success on black-box vision models without access to gradients, weights, or attack specifics. It introduces a cross-manifold embedding framework that projects surrogate and target feature vectors into a shared low-dimensional space using dimenstionality reduction (e.g., UMAP) and quantifies alignment with the normalized symmetric Hausdorff distance.H Across SI-Score, Fashion-MNIST, and NWPU-RESISC45 and multiple CNN backbones, the study finds a moderate negative correlation (approximately $ ho = -0.56$ to $-0.57$) between embedding distance $H$ and transfer success $AA(\

Abstract

Much effort has been made to explain and improve the success of transfer-based attacks (TBA) on black-box computer vision models. This work provides the first attempt at a priori prediction of attack success by identifying the presence of vulnerable features within target models. Recent work by Chen and Liu (2024) proposed the manifold attack model, a unifying framework proposing that successful TBA exist in a common manifold space. Our work experimentally tests the common manifold space hypothesis by a new methodology: first, projecting feature vectors from surrogate and target feature extractors trained on ImageNet onto the same low-dimensional manifold; second, quantifying any observed structure similarities on the manifold; and finally, by relating these observed similarities to the success of the TBA. We find that shared feature representation moderately correlates with increased success of TBA (\r{ho}= 0.56). This method may be used to predict whether an attack will transfer without information of the model weights, training, architecture or details of the attack. The results confirm the presence of shared feature representations between two feature extractors of different sizes and complexities, and demonstrate the utility of datasets from different target domains as test signals for interpreting black-box feature representations.

Towards Predicting the Success of Transfer-based Attacks by Quantifying Shared Feature Representations

TL;DR

The paper tackles predicting transfer-based attack (TBA) success on black-box vision models without access to gradients, weights, or attack specifics. It introduces a cross-manifold embedding framework that projects surrogate and target feature vectors into a shared low-dimensional space using dimenstionality reduction (e.g., UMAP) and quantifies alignment with the normalized symmetric Hausdorff distance.H Across SI-Score, Fashion-MNIST, and NWPU-RESISC45 and multiple CNN backbones, the study finds a moderate negative correlation (approximately to ) between embedding distance and transfer success $AA(\

Abstract

Much effort has been made to explain and improve the success of transfer-based attacks (TBA) on black-box computer vision models. This work provides the first attempt at a priori prediction of attack success by identifying the presence of vulnerable features within target models. Recent work by Chen and Liu (2024) proposed the manifold attack model, a unifying framework proposing that successful TBA exist in a common manifold space. Our work experimentally tests the common manifold space hypothesis by a new methodology: first, projecting feature vectors from surrogate and target feature extractors trained on ImageNet onto the same low-dimensional manifold; second, quantifying any observed structure similarities on the manifold; and finally, by relating these observed similarities to the success of the TBA. We find that shared feature representation moderately correlates with increased success of TBA (\r{ho}= 0.56). This method may be used to predict whether an attack will transfer without information of the model weights, training, architecture or details of the attack. The results confirm the presence of shared feature representations between two feature extractors of different sizes and complexities, and demonstrate the utility of datasets from different target domains as test signals for interpreting black-box feature representations.

Paper Structure

This paper contains 32 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Summary of methodology used to identify shared low-dimensional manifolds. A surrogate model and target model are chosen, then used to create feature vector representations of the same dataset. Dimensionality reduction via the UMAP algorithm enables the cross projection of feature vectors from the target model onto the manifold of the surrogate model. Results are shown for the SI-Score djolonga2020robustness dataset, with ResNetv2 as the surrogate model and MobileNetv3 as the target model.
  • Figure 2: Analysis of manifold embeddings for Fashion-MNIST and NSWC-RESISC45 "RESISC". (a), (d) The surrogate model embedding generated using feature vectors $X$ from ResNetv2 and embedded by $f(X)_{UMAP}$. (b), (e) The target model embedding, generated using feature vectors $X'$ from MobileNetv3 and embedded by UMAP $g(X')_{UMAP}$. (c), (f) cross manifold embedding generated by transformation $f(X')$ and plotting with $f(X)$. The Hausdorff distance between the two embeddings is shown, where a larger value of $H$ implies that the embeddings overlap less.
  • Figure 3: Comparison of FGSM attacks generated by the ResNetv2 surrogate model against itself (solid line) and against the target MobileNet v2 model (dashed line) for different source datasets SI-Score (red), Fashion-MNIST (blue), and RESISC (green). (LEFT) Model performance for attack strength $\epsilon$. (RIGHT) Model performance for perturbation of the input images as evaluated by the average SSIM of the attacked images. Error bars are not shown, as the standard deviation is on the order of 1E-3 to 1E-6.
  • Figure 4: The success of TBA as quantified by the average accuracy AA at attack strength $\epsilon=0.03$ plotted against the normalized symmetric Hausdorff distance between the target and surrogate data embeddings on a shared low-dimensional manifold. An inverse correlation is observed, where successful attacks (high AA) correlate with small Hausdorff distances. The correlation coefficient $\rho_{H,AA}$is $-0.56$, indicating a moderate negative correlation between the distance $H$ beteween embeddings and the average accuracy $AA$ of the model. Analyzing this data using PCA results in eigenvalues of $\left[ 7.13, 0.00\right]$, indicating that the data variance can be explained by a single direction.
  • Figure 5: Repeating the UMAP embedding process with new hyperparameters: k=5 and distance=1$\times10^{-10}$ for the RESISC dataset and ResNetv2 surrogate embedding (shown at far right). The target data continues to have neighbors on the ResNet embedding, although the total area covered by the target dataset is diminished.
  • ...and 7 more figures