Explanation-Guided Adversarial Training for Robust and Interpretable Models

Chao Chen; Yanhui Chen; Shanshan Lin; Dongsheng Hong; Shu Wu; Xiangwen Liao; Chuanyi Liu

Explanation-Guided Adversarial Training for Robust and Interpretable Models

Chao Chen, Yanhui Chen, Shanshan Lin, Dongsheng Hong, Shu Wu, Xiangwen Liao, Chuanyi Liu

TL;DR

This work proposes Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality and formalizes it within the Probably Approximately Correct learning framework.

Abstract

Deep neural networks (DNNs) have achieved remarkable performance in many tasks, yet they often behave as opaque black boxes. Explanation-guided learning (EGL) methods steer DNNs using human-provided explanations or supervision on model attributions. These approaches improve interpretability but typically assume benign inputs and incur heavy annotation costs. In contrast, both predictions and saliency maps of DNNs could dramatically alter facing imperceptible perturbations or unseen patterns. Adversarial training (AT) can substantially improve robustness, but it does not guarantee that model decisions rely on semantically meaningful features. In response, we propose Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality. EGAT generates adversarial examples on the fly while imposing explanation-based constraints on the model. By jointly optimizing classification performance, adversarial robustness, and attributional stability, EGAT is not only more resistant to unexpected cases, including adversarial attacks and out-of-distribution (OOD) scenarios, but also offer human-interpretable justifications for the decisions. We further formalize EGAT within the Probably Approximately Correct learning framework, demonstrating theoretically that it yields more stable predictions under unexpected situations compared to standard AT. Empirical evaluations on OOD benchmark datasets show that EGAT consistently outperforms competitive baselines in both clean accuracy and adversarial accuracy +37% while producing more semantically meaningful explanations, and requiring only a limited increase +16% in training time.

Explanation-Guided Adversarial Training for Robust and Interpretable Models

TL;DR

Abstract

Paper Structure (47 sections, 2 theorems, 23 equations, 4 figures, 4 tables)

This paper contains 47 sections, 2 theorems, 23 equations, 4 figures, 4 tables.

Introduction
Related Work
Explanation-guided learning
Robust Machine Learning
Preliminaries
Explainable Machine Learning and Grad-CAM
Explanation guided learning
Adversarial Training
Methodology
Explanation-Guided Adversarial Training Framework
Overall Objective
Adversarial Training
Explanation guided learning
Regularizers
PAC-Theoretic Analysis for EGAT
...and 32 more sections

Key Result

Theorem 1

Let $f \in \mathcal{H}$ be a hypothesis satisfying Assumptions assumption:lipschitz_continuity and assumption:bounded_gradient_magnitude. Then, with probability at least $1 - \delta$ over the draw of $n$ i.i.d. samples from $\mathcal{D}$, the following holds: where $d$ is the input dimensionality and $n$ is the sample size.

Figures (4)

Figure 1: Overview of the proposed EGAT framework: For each original input, EGAT optimizes a complementary objectives of four terms: (1) Traditional classification loss ($\mathcal{L}_{cls}$, in gray) is simply calculated by cross entropy; (2) Adversarial training based loss ($\mathcal{L}_{adv}$, in red) promotes the robustness for both predictions and explanations; (3) Explanation guided learning loss ($\mathcal{L}_{egl}$, in purple) adopts explanations as extra supervision signals; (4) Regularizer ($\mathcal{L}_{reg}$, in yellow) incorporates mixup strategy to further regulate model behaviors.
Figure 2: Adversarial robustness across different attack methods (FGSM, MI-FGSM, PGD, SCA) and datasets (VLCS and Terra Incognita). The figures show accuracy under varying perturbation budgets, ranging from 0 (clean) to 0.04. EGAT (in red solid curves) shows significantly more stable predictions than baselines under strong adversarial attacks.
Figure 3: Adversarial accuracy vs. training time per epoch. Red points denote Pareto-optimal methods and the red curve denotes the Pareto front.
Figure 4: Visualization of explanation heatmaps under clean and adversarial conditions. We present Grad-CAM heatmaps of four models (columns: ERM, DRE, SGDrop, and EGAT) for three images before and after PGD perturbations. These visualizations illustrate where each model attends when making predictions and how their explanations change under adversarial attacks.

Theorems & Definitions (7)

Definition 1: EGAT Adversarial Loss
Definition 2: Population Adversarial Risk under EGAT
Definition 3: Empirical Adversarial Risk under EGAT
Theorem 1: Generalization Bound of Adversarial Risk for EGAT
proof : Proof Sketch
Lemma 1: Out-of-Distribution Stability via Explanation Consistency
proof : Proof Sketch

Explanation-Guided Adversarial Training for Robust and Interpretable Models

TL;DR

Abstract

Explanation-Guided Adversarial Training for Robust and Interpretable Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (7)