Table of Contents
Fetching ...

Meta Invariance Defense Towards Generalizable Robustness to Unknown Adversarial Attacks

Lei Zhang, Yuhang Zhou, Yi Yang, Xinbo Gao

TL;DR

The paper addresses the vulnerability of deep neural networks to adversarial attacks, emphasizing poor robustness to unknown attacks. It introduces Meta Invariance Defense (MID), a two-stage meta-learning framework with a fixed teacher and a trainable student encoder, augmented by multi-consistency distillation to learn attack-invariant features from an Attacker Pool. The authors provide theoretical analyses (Taylor expansion, manifold interpretation, and high-order optimization) and comprehensive experiments across eight datasets, showing that MID achieves superior average robustness against both known and unknown attacks, with interpretable gradient and feature representations. The work offers a practical path toward generalizable robustness and insights into the role of low-frequency, attack-invariant information for defense against adversarial perturbations.

Abstract

Despite providing high-performance solutions for computer vision tasks, the deep neural network (DNN) model has been proved to be extremely vulnerable to adversarial attacks. Current defense mainly focuses on the known attacks, but the adversarial robustness to the unknown attacks is seriously overlooked. Besides, commonly used adaptive learning and fine-tuning technique is unsuitable for adversarial defense since it is essentially a zero-shot problem when deployed. Thus, to tackle this challenge, we propose an attack-agnostic defense method named Meta Invariance Defense (MID). Specifically, various combinations of adversarial attacks are randomly sampled from a manually constructed Attacker Pool to constitute different defense tasks against unknown attacks, in which a student encoder is supervised by multi-consistency distillation to learn the attack-invariant features via a meta principle. The proposed MID has two merits: 1) Full distillation from pixel-, feature- and prediction-level between benign and adversarial samples facilitates the discovery of attack-invariance. 2) The model simultaneously achieves robustness to the imperceptible adversarial perturbations in high-level image classification and attack-suppression in low-level robust image regeneration. Theoretical and empirical studies on numerous benchmarks such as ImageNet verify the generalizable robustness and superiority of MID under various attacks.

Meta Invariance Defense Towards Generalizable Robustness to Unknown Adversarial Attacks

TL;DR

The paper addresses the vulnerability of deep neural networks to adversarial attacks, emphasizing poor robustness to unknown attacks. It introduces Meta Invariance Defense (MID), a two-stage meta-learning framework with a fixed teacher and a trainable student encoder, augmented by multi-consistency distillation to learn attack-invariant features from an Attacker Pool. The authors provide theoretical analyses (Taylor expansion, manifold interpretation, and high-order optimization) and comprehensive experiments across eight datasets, showing that MID achieves superior average robustness against both known and unknown attacks, with interpretable gradient and feature representations. The work offers a practical path toward generalizable robustness and insights into the role of low-frequency, attack-invariant information for defense against adversarial perturbations.

Abstract

Despite providing high-performance solutions for computer vision tasks, the deep neural network (DNN) model has been proved to be extremely vulnerable to adversarial attacks. Current defense mainly focuses on the known attacks, but the adversarial robustness to the unknown attacks is seriously overlooked. Besides, commonly used adaptive learning and fine-tuning technique is unsuitable for adversarial defense since it is essentially a zero-shot problem when deployed. Thus, to tackle this challenge, we propose an attack-agnostic defense method named Meta Invariance Defense (MID). Specifically, various combinations of adversarial attacks are randomly sampled from a manually constructed Attacker Pool to constitute different defense tasks against unknown attacks, in which a student encoder is supervised by multi-consistency distillation to learn the attack-invariant features via a meta principle. The proposed MID has two merits: 1) Full distillation from pixel-, feature- and prediction-level between benign and adversarial samples facilitates the discovery of attack-invariance. 2) The model simultaneously achieves robustness to the imperceptible adversarial perturbations in high-level image classification and attack-suppression in low-level robust image regeneration. Theoretical and empirical studies on numerous benchmarks such as ImageNet verify the generalizable robustness and superiority of MID under various attacks.
Paper Structure (23 sections, 15 equations, 14 figures, 13 tables)

This paper contains 23 sections, 15 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: An intuitive interpretation of attack invariant features. Humans are good at extracting robust invariant features from cats (such as beard and ears). We argue that there exist such kind of Robust Attack Invariant Features (the blue points) between various adversarial attacks and benign samples, which only associate robust semantic rather than malicious perturbations. In this sense, the so-called benign and adversarial samples add harmonious benign features (the green +) and malicious adversarial features (the red -) to the invariant feature, respectively.
  • Figure 2: Schematic diagram of MID's training process. For each epoch, MID can be divided into two stages: Meta-Train and Meta-Test. (1) In the Meta-Train phase, we randomly extract $n-1$ attack combinations from the Attacker Pool as the simulated leaky attacks to train the robustness to known attacks. (2) During Meta-test phase, we choose the $n^{th}$ attack not used during Meta-Train as the simulated unknown attack to train the generalizable robustness to unknown attacks. The words with single quotation marks mean that they are simulated known/unknown attacks during the meta learning process, and the real unknown attack are never seen in the whole meta learning process.
  • Figure 3: The framework of Meta Invariance Defense. The teacher model with green shades is fixed in the training process, while the student module with blue shade is updated online based on the proposed multi-consistency distillation via the meta learning paradigm. The output of teacher decode is the regenerated robust sample. The final test is conducted by the robust student encoder $E_{student}$ and classifier $C_{teacher}(w,b)$.
  • Figure 4: Schematic diagram of manifold interpretation for MID. Gradient based adversarial samples push the benign samples away from the original manifold along the normal vector, since the gradient corresponds to the direction of the normal vector. MID aims to learn the similarity between the original features (benign samples) and neighbor features (i.e., adversarial samples) under a multi-consistency distillation via meta learning, and pulls the adversarial samples back to the original manifold.
  • Figure 5: A toy example between the second derivative (Hessian matrix) and the robustness of the function. We take $sin(x)$ and $sin(\pi x)$ as an example. At the minimum point, $sin(x)$ and $sin(\pi x)$ have the same minimum value and first-derivative value, but $sin(x)$ is more robust than $sin(\pi x)$ since $sin(x)$ has a smaller second-derivative value. Thus both the curve of $sin(x)$ and its first derivative are smoother. MID realizes implicit regularization to the second-derivative of loss function.
  • ...and 9 more figures