Investigating and unmasking feature-level vulnerabilities of CNNs to adversarial perturbations

Davide Coppola; Hwee Kuan Lee

Investigating and unmasking feature-level vulnerabilities of CNNs to adversarial perturbations

Davide Coppola, Hwee Kuan Lee

TL;DR

This paper tackles why CNNs are vulnerable to adversarial perturbations by shifting focus to feature-map representations in shallow layers. It introduces the Adversarial Intervention framework to causally test how perturbing selected channels affects predictions, quantified through metrics like $AEL_{\Phi}$ and $AEA_{\Phi}$. Across MNIST-37, CIFAR-10, and Imagenette with Auto-PGD and other attacks, it shows a small set of first-layer channels can dominate vulnerability and that channel rankings are largely consistent across attacks, with vulnerability correlating to kernel $||\cdot||_2$ norms. The work provides a diagnostic, causality-based basis for future, targeted defenses and a deeper understanding of adversarial perturbations at the feature-map level.

Abstract

This study explores the impact of adversarial perturbations on Convolutional Neural Networks (CNNs) with the aim of enhancing the understanding of their underlying mechanisms. Despite numerous defense methods proposed in the literature, there is still an incomplete understanding of this phenomenon. Instead of treating the entire model as vulnerable, we propose that specific feature maps learned during training contribute to the overall vulnerability. To investigate how the hidden representations learned by a CNN affect its vulnerability, we introduce the Adversarial Intervention framework. Experiments were conducted on models trained on three well-known computer vision datasets, subjecting them to attacks of different nature. Our focus centers on the effects that adversarial perturbations to a model's initial layer have on the overall behavior of the model. Empirical results revealed compelling insights: a) perturbing selected channel combinations in shallow layers causes significant disruptions; b) the channel combinations most responsible for the disruptions are common among different types of attacks; c) despite shared vulnerable combinations of channels, different attacks affect hidden representations with varying magnitudes; d) there exists a positive correlation between a kernel's magnitude and its vulnerability. In conclusion, this work introduces a novel framework to study the vulnerability of a CNN model to adversarial perturbations, revealing insights that contribute to a deeper understanding of the phenomenon. The identified properties pave the way for the development of efficient ad-hoc defense mechanisms in future applications.

Investigating and unmasking feature-level vulnerabilities of CNNs to adversarial perturbations

TL;DR

and

. Across MNIST-37, CIFAR-10, and Imagenette with Auto-PGD and other attacks, it shows a small set of first-layer channels can dominate vulnerability and that channel rankings are largely consistent across attacks, with vulnerability correlating to kernel

norms. The work provides a diagnostic, causality-based basis for future, targeted defenses and a deeper understanding of adversarial perturbations at the feature-map level.

Abstract

Paper Structure (13 sections, 9 equations, 15 figures, 1 table)

This paper contains 13 sections, 9 equations, 15 figures, 1 table.

Introduction
Related Works
Methods
The Adversarial Intervention framework
Metrics
Results
Experimental Setup
Analyzing the effect of feature-level adversarial perturbations
Inspecting the effect of attacks with different nature
What makes a channel more vulnerable than others?
Conclusions
Statistical significance
Additional Plots

Figures (15)

Figure 1: Schematic of the Adversarial Intervention framework. A CNN model is given, which can be expressed as two consecutive functions $g_A$ and $g_B$. 1. Given a clean sample $x$, the intermediate feature representation $h=g_A(x)$ and the output logits $z=g_B(g_A(x))$ are computed. 2. The adversarial example $x'=x+\eta$ is computed using an arbitrary adversarial attack algorithm from the literature. Subsequently, the corrupted intermediate feature representation is computed $h'=g_A(x')$. 3. The Adversarial Intervention operation consists in swapping $\gamma$ arbitrary channels in the the clean representation $h$ with their equivalents in the corrupted representation $h'$; this is achieved through a masking operation. The resulting set of features $h^\Phi$ is then used to compute the output logits $z^\Phi=g_B(h^\Phi)$. 4. The effect of the intervention is evaluated by computing various metrics.
Figure 2: Results of Adversarial Intervention (Auto-PGD) on CIFAR-10 model for $\gamma=3$. (a) Top and bottom 10 channel combinations, ranked by $AEL_\Phi$. Certain combinations $\Phi$ can disrupt the model performance almost completely, while others barely affect it. (b) Channel-wise effects on logits and accuracy. The effect of certain channels is significantly different than others.
Figure 3: Results of Adversarial Intervention (Auto-PGD) on Imagenette for $\gamma=4$. (a) Top and bottom 10 channel combinations, ranked by $AEL_\Phi$. (b) Channel-wise effects on logits and accuracy. The plots show a clear distinction in the model's response to top- and bottom-ranking combinations, and highlight specific channels that significantly disrupt the model's output.
Figure 4: Channel ranking over values of $\gamma$ for Adversarial Intervention with Auto-PGD. In all three models, the ranking of a channel in terms of its average effect on the model output is relatively stable as the value of $\gamma$ increases.
Figure 5: Comparison among the effects on logits of individual channel combinations for attacks of different nature on the same ResNet20 model trained on CIFAR-10. The values in the axes indicate the Average Effect on Logits (AEL) when a given channel combination is perturbed through Adversarial Intervention with the attack indicated on the axis label. Each point corresponds to channel combination. The results in these plots are for combinations of $\gamma=3$ channels.
...and 10 more figures

Theorems & Definitions (4)

Definition 1: $AEL_{\Phi}$: Average Effect on Logits of $\Phi$
Definition 2: $AEL_{j}^{\gamma}$: Average Effect on Logits of channel $j$
Definition 3: $AEA_{\Phi}$: Average Effect on Accuracy of $\Phi$
Definition 4: $AEA_{j}^{\gamma}$: Average Effect on Accuracy of channel $j$

Investigating and unmasking feature-level vulnerabilities of CNNs to adversarial perturbations

TL;DR

Abstract

Investigating and unmasking feature-level vulnerabilities of CNNs to adversarial perturbations

Authors

TL;DR

Abstract

Table of Contents

Figures (15)

Theorems & Definitions (4)