Table of Contents
Fetching ...

DeepDefense: Layer-Wise Gradient-Feature Alignment for Building Robust Neural Networks

Ci Lin, Tet Yeap, Iluju Kiringa, Biwei Zhang

TL;DR

DeepDefense tackles adversarial vulnerability by applying Gradient-Feature Alignment (GFA) regularization layer-wise to align input gradients with internal representations, blocking perturbation propagation. By decomposing perturbations into radial and tangential components, GFA promotes a flatter loss landscape and reduces sensitivity in tangential directions where attacks are effective. Empirically on CIFAR-10, DeepDefense improves robustness against both gradient-based and optimization-based attacks, outperforming standard training and several existing defenses. The method is architecture-agnostic and lightweight, making it practical for real-world deployment and scalable to larger models and datasets.

Abstract

Deep neural networks are known to be vulnerable to adversarial perturbations, which are small and carefully crafted inputs that lead to incorrect predictions. In this paper, we propose DeepDefense, a novel defense framework that applies Gradient-Feature Alignment (GFA) regularization across multiple layers to suppress adversarial vulnerability. By aligning input gradients with internal feature representations, DeepDefense promotes a smoother loss landscape in tangential directions, thereby reducing the model's sensitivity to adversarial noise. We provide theoretical insights into how adversarial perturbation can be decomposed into radial and tangential components and demonstrate that alignment suppresses loss variation in tangential directions, where most attacks are effective. Empirically, our method achieves significant improvements in robustness across both gradient-based and optimization-based attacks. For example, on CIFAR-10, CNN models trained with DeepDefense outperform standard adversarial training by up to 15.2% under APGD attacks and 24.7% under FGSM attacks. Against optimization-based attacks such as DeepFool and EADEN, DeepDefense requires 20 to 30 times higher perturbation magnitudes to cause misclassification, indicating stronger decision boundaries and a flatter loss landscape. Our approach is architecture-agnostic, simple to implement, and highly effective, offering a promising direction for improving the adversarial robustness of deep learning models.

DeepDefense: Layer-Wise Gradient-Feature Alignment for Building Robust Neural Networks

TL;DR

DeepDefense tackles adversarial vulnerability by applying Gradient-Feature Alignment (GFA) regularization layer-wise to align input gradients with internal representations, blocking perturbation propagation. By decomposing perturbations into radial and tangential components, GFA promotes a flatter loss landscape and reduces sensitivity in tangential directions where attacks are effective. Empirically on CIFAR-10, DeepDefense improves robustness against both gradient-based and optimization-based attacks, outperforming standard training and several existing defenses. The method is architecture-agnostic and lightweight, making it practical for real-world deployment and scalable to larger models and datasets.

Abstract

Deep neural networks are known to be vulnerable to adversarial perturbations, which are small and carefully crafted inputs that lead to incorrect predictions. In this paper, we propose DeepDefense, a novel defense framework that applies Gradient-Feature Alignment (GFA) regularization across multiple layers to suppress adversarial vulnerability. By aligning input gradients with internal feature representations, DeepDefense promotes a smoother loss landscape in tangential directions, thereby reducing the model's sensitivity to adversarial noise. We provide theoretical insights into how adversarial perturbation can be decomposed into radial and tangential components and demonstrate that alignment suppresses loss variation in tangential directions, where most attacks are effective. Empirically, our method achieves significant improvements in robustness across both gradient-based and optimization-based attacks. For example, on CIFAR-10, CNN models trained with DeepDefense outperform standard adversarial training by up to 15.2% under APGD attacks and 24.7% under FGSM attacks. Against optimization-based attacks such as DeepFool and EADEN, DeepDefense requires 20 to 30 times higher perturbation magnitudes to cause misclassification, indicating stronger decision boundaries and a flatter loss landscape. Our approach is architecture-agnostic, simple to implement, and highly effective, offering a promising direction for improving the adversarial robustness of deep learning models.

Paper Structure

This paper contains 28 sections, 23 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: Visualization of adversarial perturbations generated by DeepFool. The top row shows the adversarial example (left) and corresponding perturbation (right) for a model trained with DeepDefense. The bottom row shows the same for a model trained with standard backpropagation. DeepDefense requires larger, more visible perturbations to fool the model, indicating improved robustness.
  • Figure 2: Loss landscapes of models trained with the DEEP strategy (top) and standard backpropagation (bottom), shown in both 2D and 3D. The DEEP model exhibits smoother surfaces with overall flatness, while the standard model is relatively flat only in the radial direction and more sensitive to perturbations.
  • Figure 3: Feature maps from the first convolutional layer of CNNs trained with different strategies. From left to right: original input (first column), standard backpropagation, PGD adversarial training, GFA in the first layer, GFA in the first three layers, GAIE regularization, and feature denoising strategy.
  • Figure 4: Evaluation of model accuracy degradation under increasing adversarial perturbation strength ($\epsilon$) for CNNs trained with different defense strategies against FGSM, PIFGSM, Square, APGD, APGDT, and EOTPGD attacks.
  • Figure 5: Comparison of the robustness of six training strategies (Benchmark, Adv, First, Deep, GAIE, Denoise) under four adversarial attacks: (a) Deep Sparse Fool, (b) Jitter, (c) EADEN, and (d) OnePixel. Each sub-image illustrates how a single sample is perturbed by one attacker and evaluated across different models. The numbers on top of the second row of each sub-figure indicate the noise intensity (measured in mean square error) applied to the perturbed samples.
  • ...and 7 more figures