Interpreting Adversarial Attacks and Defences using Architectures with Enhanced Interpretability
Akshay G Rao, Chandrashekhar Lakshminarayanan, Arun Rajkumar
TL;DR
The paper addresses adversarial robustness by leveraging the interpretability of Deep Linearly Gated Networks (DLGN) to compare PGD adversarial training (PGD-AT) with standard training (STD-TR). By merging feature-layer transformations, the authors analyze hyperplane alignment, path activity, and gating patterns, revealing that PGD-AT yields hyperplanes farther from data points, more diverse active subnetworks, and reduced gate-overlap growth under attack. PCA experiments show that embedding PCA can harm PGD-AT robustness and that STD-TR aligns more with principal components, suggesting misalignment with robustness objectives. Overall, the work provides mechanistic insights into how PGD-AT reorganizes internal representations to improve robustness and suggests directions for future architecture design and adversarial training strategies.
Abstract
Adversarial attacks in deep learning represent a significant threat to the integrity and reliability of machine learning models. Adversarial training has been a popular defence technique against these adversarial attacks. In this work, we capitalize on a network architecture, namely Deep Linearly Gated Networks (DLGN), which has better interpretation capabilities than regular deep network architectures. Using this architecture, we interpret robust models trained using PGD adversarial training and compare them with standard training. Feature networks in DLGN act as feature extractors, making them the only medium through which an adversary can attack the model. We analyze the feature network of DLGN with fully connected layers with respect to properties like alignment of the hyperplanes, hyperplane relation with PCA, and sub-network overlap among classes and compare these properties between robust and standard models. We also consider this architecture having CNN layers wherein we qualitatively (using visualizations) and quantitatively contrast gating patterns between robust and standard models. We uncover insights into hyperplanes resembling principal components in PGD-AT and STD-TR models, with PGD-AT hyperplanes aligned farther from the data points. We use path activity analysis to show that PGD-AT models create diverse, non-overlapping active subnetworks across classes, preventing attack-induced gating overlaps. Our visualization ideas show the nature of representations learnt by PGD-AT and STD-TR models.
