Table of Contents
Fetching ...

Interpretability-Guided Test-Time Adversarial Defense

Akshay Kulkarni, Tsui-Wei Weng

TL;DR

This paper tackles adversarial vulnerability in deep networks by proposing IG-Defense, a training-free test-time defense that masks neurons based on interpretability-derived importance rankings. By restricting activation shifts to GT-class–relevant neurons and using a dual forward pass with a sharp pseudo-label, IG-Defense achieves robust improvements across CIFAR10/100 and ImageNet-1k on RobustBench, with a modest 2× inference-time overhead. The authors introduce LO-IR and CD-IR as two practical importance-ranking methods and validate their effectiveness against strong white-box, black-box, and adaptive attacks, outperforming several existing test-time defenses in worst-case robustness. The approach is efficient, scalable, and leverages neuron-level interpretability to bridge robustness and practicality in real-world deployments.

Abstract

We propose a novel and low-cost test-time adversarial defense by devising interpretability-guided neuron importance ranking methods to identify neurons important to the output classes. Our method is a training-free approach that can significantly improve the robustness-accuracy tradeoff while incurring minimal computational overhead. While being among the most efficient test-time defenses (4x faster), our method is also robust to a wide range of black-box, white-box, and adaptive attacks that break previous test-time defenses. We demonstrate the efficacy of our method for CIFAR10, CIFAR100, and ImageNet-1k on the standard RobustBench benchmark (with average gains of 2.6%, 4.9%, and 2.8% respectively). We also show improvements (average 1.5%) over the state-of-the-art test-time defenses even under strong adaptive attacks.

Interpretability-Guided Test-Time Adversarial Defense

TL;DR

This paper tackles adversarial vulnerability in deep networks by proposing IG-Defense, a training-free test-time defense that masks neurons based on interpretability-derived importance rankings. By restricting activation shifts to GT-class–relevant neurons and using a dual forward pass with a sharp pseudo-label, IG-Defense achieves robust improvements across CIFAR10/100 and ImageNet-1k on RobustBench, with a modest 2× inference-time overhead. The authors introduce LO-IR and CD-IR as two practical importance-ranking methods and validate their effectiveness against strong white-box, black-box, and adaptive attacks, outperforming several existing test-time defenses in worst-case robustness. The approach is efficient, scalable, and leverages neuron-level interpretability to bridge robustness and practicality in real-world deployments.

Abstract

We propose a novel and low-cost test-time adversarial defense by devising interpretability-guided neuron importance ranking methods to identify neurons important to the output classes. Our method is a training-free approach that can significantly improve the robustness-accuracy tradeoff while incurring minimal computational overhead. While being among the most efficient test-time defenses (4x faster), our method is also robust to a wide range of black-box, white-box, and adaptive attacks that break previous test-time defenses. We demonstrate the efficacy of our method for CIFAR10, CIFAR100, and ImageNet-1k on the standard RobustBench benchmark (with average gains of 2.6%, 4.9%, and 2.8% respectively). We also show improvements (average 1.5%) over the state-of-the-art test-time defenses even under strong adaptive attacks.
Paper Structure (21 sections, 5 equations, 8 figures, 12 tables)

This paper contains 21 sections, 5 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Interpretability-guided masking based on neuron importance ranking for test-time adversarial defense.
  • Figure 2: A. Cat-dog-bird image classifier before a PGD attack. B. After PGD attack, the prediction changes from the ground truth (GT) cat to dog since the activations of neurons important to dog class increase while those important to cat class decrease. C. Empirically, successful PGD attacks show a decrease in activations of important GT class neurons while those of post-attack predicted class' important neurons increase. D. For unsuccessful PGD attacks, the activations of all neurons reduce even though the prediction remains the same as before the attack.
  • Figure 3: Step 1. Given a pretrained base model $f$ (e.g. binary classifier here), class-wise neuron importance is computed for a selected layer (Sec. \ref{['sec:loir']}). A top-$k$ mask $m\!\in\!\{0, 1\}^{N\times C}$ is computed to identify top-$k$ neurons important to each class (e.g.$k=2, N=5, C=2$ here). Step 2. During evaluation, a soft-pseudo-label $\hat{y}$ is computed using the base model $f$. Step 3. The soft-pseudo-label weighted mask $m\hat{y} = m\; \sigma(\frac{f(x)}{\tau})\in \mathbb{R}^N$ is applied to the selected layer to retain only the important neurons of the pseudo-label class.
  • Figure 4: A. Leave-one-Out Importance Ranking (LO-IR) computes class-wise neuron importance as the average change in logits when masking each neuron. B. CLIP-Dissect Importance Ranking (CD-IR) first computes the inner product of class name text embeddings and probing set image embeddings. The importance ranking relies on the similarity between activations of each neuron to the precomputed inner product given the same probing set inputs. C. A higher logit-change $\Delta_{c_i}^{[j]}$ or similarity $s_{c_i}^{[j]}$ implies that neuron $j$ is more important to class $c_i$.
  • Figure 5: Sensitivity analyses for number of retained important neurons $k$ and layer being masked with CIFAR10. For A, B: layer4 with total 512 neurons is used. A lower $k$ and a deeper layer yield better adversarial robustness with minimal loss of clean accuracy.
  • ...and 3 more figures