Table of Contents
Fetching ...

Interpreting Adversarial Attacks and Defences using Architectures with Enhanced Interpretability

Akshay G Rao, Chandrashekhar Lakshminarayanan, Arun Rajkumar

TL;DR

The paper addresses adversarial robustness by leveraging the interpretability of Deep Linearly Gated Networks (DLGN) to compare PGD adversarial training (PGD-AT) with standard training (STD-TR). By merging feature-layer transformations, the authors analyze hyperplane alignment, path activity, and gating patterns, revealing that PGD-AT yields hyperplanes farther from data points, more diverse active subnetworks, and reduced gate-overlap growth under attack. PCA experiments show that embedding PCA can harm PGD-AT robustness and that STD-TR aligns more with principal components, suggesting misalignment with robustness objectives. Overall, the work provides mechanistic insights into how PGD-AT reorganizes internal representations to improve robustness and suggests directions for future architecture design and adversarial training strategies.

Abstract

Adversarial attacks in deep learning represent a significant threat to the integrity and reliability of machine learning models. Adversarial training has been a popular defence technique against these adversarial attacks. In this work, we capitalize on a network architecture, namely Deep Linearly Gated Networks (DLGN), which has better interpretation capabilities than regular deep network architectures. Using this architecture, we interpret robust models trained using PGD adversarial training and compare them with standard training. Feature networks in DLGN act as feature extractors, making them the only medium through which an adversary can attack the model. We analyze the feature network of DLGN with fully connected layers with respect to properties like alignment of the hyperplanes, hyperplane relation with PCA, and sub-network overlap among classes and compare these properties between robust and standard models. We also consider this architecture having CNN layers wherein we qualitatively (using visualizations) and quantitatively contrast gating patterns between robust and standard models. We uncover insights into hyperplanes resembling principal components in PGD-AT and STD-TR models, with PGD-AT hyperplanes aligned farther from the data points. We use path activity analysis to show that PGD-AT models create diverse, non-overlapping active subnetworks across classes, preventing attack-induced gating overlaps. Our visualization ideas show the nature of representations learnt by PGD-AT and STD-TR models.

Interpreting Adversarial Attacks and Defences using Architectures with Enhanced Interpretability

TL;DR

The paper addresses adversarial robustness by leveraging the interpretability of Deep Linearly Gated Networks (DLGN) to compare PGD adversarial training (PGD-AT) with standard training (STD-TR). By merging feature-layer transformations, the authors analyze hyperplane alignment, path activity, and gating patterns, revealing that PGD-AT yields hyperplanes farther from data points, more diverse active subnetworks, and reduced gate-overlap growth under attack. PCA experiments show that embedding PCA can harm PGD-AT robustness and that STD-TR aligns more with principal components, suggesting misalignment with robustness objectives. Overall, the work provides mechanistic insights into how PGD-AT reorganizes internal representations to improve robustness and suggests directions for future architecture design and adversarial training strategies.

Abstract

Adversarial attacks in deep learning represent a significant threat to the integrity and reliability of machine learning models. Adversarial training has been a popular defence technique against these adversarial attacks. In this work, we capitalize on a network architecture, namely Deep Linearly Gated Networks (DLGN), which has better interpretation capabilities than regular deep network architectures. Using this architecture, we interpret robust models trained using PGD adversarial training and compare them with standard training. Feature networks in DLGN act as feature extractors, making them the only medium through which an adversary can attack the model. We analyze the feature network of DLGN with fully connected layers with respect to properties like alignment of the hyperplanes, hyperplane relation with PCA, and sub-network overlap among classes and compare these properties between robust and standard models. We also consider this architecture having CNN layers wherein we qualitatively (using visualizations) and quantitatively contrast gating patterns between robust and standard models. We uncover insights into hyperplanes resembling principal components in PGD-AT and STD-TR models, with PGD-AT hyperplanes aligned farther from the data points. We use path activity analysis to show that PGD-AT models create diverse, non-overlapping active subnetworks across classes, preventing attack-induced gating overlaps. Our visualization ideas show the nature of representations learnt by PGD-AT and STD-TR models.

Paper Structure

This paper contains 19 sections, 8 equations, 16 figures, 11 tables, 1 algorithm.

Figures (16)

  • Figure 1: Deep Linearly Gated Networks (DLGN) network architecture. $GALU=x*\mathit{Gate(x^')}$
  • Figure 2: PGD-AT vs STD-TR FC-DLGN -W128-D4 flip distribution. The Y-axis denotes the fraction of points that flipped the gate at node indices on the X-axis.
  • Figure 3: Flip distribution per hyperplane(y-axis) vs. median projection distance(x-axis) in MNIST dataset. Each point indicates a hyperplane.
  • Figure 4: PGD-AT vs STD-TR FC-DLGN -W128-D4 median projection distance. The left image is on MNIST, and the right image is on the Fashion MNIST dataset. The Y-axis denotes the median projection distance of data points at node/hyperplane indices on the X-axis.
  • Figure 5: Robust and clean accuracies of PGD-AT and STD-TR FC-DLGN _W128_D4 models with random gate masking vs. masking gates with the highest median projection distance vs masking gates with lowest median projection distance. Dotted lines are for STD-TR models and solid lines are for PGD-AT models.
  • ...and 11 more figures