Table of Contents
Fetching ...

Explanation-Guided Adversarial Training for Robust and Interpretable Models

Chao Chen, Yanhui Chen, Shanshan Lin, Dongsheng Hong, Shu Wu, Xiangwen Liao, Chuanyi Liu

TL;DR

This work proposes Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality and formalizes it within the Probably Approximately Correct learning framework.

Abstract

Deep neural networks (DNNs) have achieved remarkable performance in many tasks, yet they often behave as opaque black boxes. Explanation-guided learning (EGL) methods steer DNNs using human-provided explanations or supervision on model attributions. These approaches improve interpretability but typically assume benign inputs and incur heavy annotation costs. In contrast, both predictions and saliency maps of DNNs could dramatically alter facing imperceptible perturbations or unseen patterns. Adversarial training (AT) can substantially improve robustness, but it does not guarantee that model decisions rely on semantically meaningful features. In response, we propose Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality. EGAT generates adversarial examples on the fly while imposing explanation-based constraints on the model. By jointly optimizing classification performance, adversarial robustness, and attributional stability, EGAT is not only more resistant to unexpected cases, including adversarial attacks and out-of-distribution (OOD) scenarios, but also offer human-interpretable justifications for the decisions. We further formalize EGAT within the Probably Approximately Correct learning framework, demonstrating theoretically that it yields more stable predictions under unexpected situations compared to standard AT. Empirical evaluations on OOD benchmark datasets show that EGAT consistently outperforms competitive baselines in both clean accuracy and adversarial accuracy +37% while producing more semantically meaningful explanations, and requiring only a limited increase +16% in training time.

Explanation-Guided Adversarial Training for Robust and Interpretable Models

TL;DR

This work proposes Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality and formalizes it within the Probably Approximately Correct learning framework.

Abstract

Deep neural networks (DNNs) have achieved remarkable performance in many tasks, yet they often behave as opaque black boxes. Explanation-guided learning (EGL) methods steer DNNs using human-provided explanations or supervision on model attributions. These approaches improve interpretability but typically assume benign inputs and incur heavy annotation costs. In contrast, both predictions and saliency maps of DNNs could dramatically alter facing imperceptible perturbations or unseen patterns. Adversarial training (AT) can substantially improve robustness, but it does not guarantee that model decisions rely on semantically meaningful features. In response, we propose Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality. EGAT generates adversarial examples on the fly while imposing explanation-based constraints on the model. By jointly optimizing classification performance, adversarial robustness, and attributional stability, EGAT is not only more resistant to unexpected cases, including adversarial attacks and out-of-distribution (OOD) scenarios, but also offer human-interpretable justifications for the decisions. We further formalize EGAT within the Probably Approximately Correct learning framework, demonstrating theoretically that it yields more stable predictions under unexpected situations compared to standard AT. Empirical evaluations on OOD benchmark datasets show that EGAT consistently outperforms competitive baselines in both clean accuracy and adversarial accuracy +37% while producing more semantically meaningful explanations, and requiring only a limited increase +16% in training time.
Paper Structure (47 sections, 2 theorems, 23 equations, 4 figures, 4 tables)

This paper contains 47 sections, 2 theorems, 23 equations, 4 figures, 4 tables.

Key Result

Theorem 1

Let $f \in \mathcal{H}$ be a hypothesis satisfying Assumptions assumption:lipschitz_continuity and assumption:bounded_gradient_magnitude. Then, with probability at least $1 - \delta$ over the draw of $n$ i.i.d. samples from $\mathcal{D}$, the following holds: where $d$ is the input dimensionality and $n$ is the sample size.

Figures (4)

  • Figure 1: Overview of the proposed EGAT framework: For each original input, EGAT optimizes a complementary objectives of four terms: (1) Traditional classification loss ($\mathcal{L}_{cls}$, in gray) is simply calculated by cross entropy; (2) Adversarial training based loss ($\mathcal{L}_{adv}$, in red) promotes the robustness for both predictions and explanations; (3) Explanation guided learning loss ($\mathcal{L}_{egl}$, in purple) adopts explanations as extra supervision signals; (4) Regularizer ($\mathcal{L}_{reg}$, in yellow) incorporates mixup strategy to further regulate model behaviors.
  • Figure 2: Adversarial robustness across different attack methods (FGSM, MI-FGSM, PGD, SCA) and datasets (VLCS and Terra Incognita). The figures show accuracy under varying perturbation budgets, ranging from 0 (clean) to 0.04. EGAT (in red solid curves) shows significantly more stable predictions than baselines under strong adversarial attacks.
  • Figure 3: Adversarial accuracy vs. training time per epoch. Red points denote Pareto-optimal methods and the red curve denotes the Pareto front.
  • Figure 4: Visualization of explanation heatmaps under clean and adversarial conditions. We present Grad-CAM heatmaps of four models (columns: ERM, DRE, SGDrop, and EGAT) for three images before and after PGD perturbations. These visualizations illustrate where each model attends when making predictions and how their explanations change under adversarial attacks.

Theorems & Definitions (7)

  • Definition 1: EGAT Adversarial Loss
  • Definition 2: Population Adversarial Risk under EGAT
  • Definition 3: Empirical Adversarial Risk under EGAT
  • Theorem 1: Generalization Bound of Adversarial Risk for EGAT
  • proof : Proof Sketch
  • Lemma 1: Out-of-Distribution Stability via Explanation Consistency
  • proof : Proof Sketch