Explainable Adversarial Attacks on Coarse-to-Fine Classifiers

Akram Heidarizadeh; Connor Hatfield; Lorenzo Lazzarotto; HanQin Cai; George Atia

Explainable Adversarial Attacks on Coarse-to-Fine Classifiers

Akram Heidarizadeh, Connor Hatfield, Lorenzo Lazzarotto, HanQin Cai, George Atia

TL;DR

The paper tackles the problem of explainable adversarial attacks on coarse-to-fine (C2F) classifiers by introducing LRP-guided perturbations that influence heatmaps used for decision reasoning at both coarse and fine stages. It presents two attack variants, LRPC (coarse-level) and LRPF (fine-level), each optimizing perturbations via gradient descent to manipulate Layer-wise Relevance Propagation heatmaps while obeying an $\ell_\infty$ budget $\|\eta\|_\infty \le \epsilon$. The authors formalize the C2F model, define LRP-based loss terms for both levels, and demonstrate on ImageNet with a VGG-16 backbone that the attacks can achieve high fooling rates with perceptibility comparable to strong baselines, all while yielding more interpretable failure modes through heatmap shifts. This work contributes a framework for understanding and auditing hierarchical model behavior under adversarial perturbations, with potential implications for model explainability and robustness analysis in multi-stage vision systems.

Abstract

Traditional adversarial attacks typically aim to alter the predicted labels of input images by generating perturbations that are imperceptible to the human eye. However, these approaches often lack explainability. Moreover, most existing work on adversarial attacks focuses on single-stage classifiers, but multi-stage classifiers are largely unexplored. In this paper, we introduce instance-based adversarial attacks for multi-stage classifiers, leveraging Layer-wise Relevance Propagation (LRP), which assigns relevance scores to pixels based on their influence on classification outcomes. Our approach generates explainable adversarial perturbations by utilizing LRP to identify and target key features critical for both coarse and fine-grained classifications. Unlike conventional attacks, our method not only induces misclassification but also enhances the interpretability of the model's behavior across classification stages, as demonstrated by experimental results.

Explainable Adversarial Attacks on Coarse-to-Fine Classifiers

TL;DR

budget

. The authors formalize the C2F model, define LRP-based loss terms for both levels, and demonstrate on ImageNet with a VGG-16 backbone that the attacks can achieve high fooling rates with perceptibility comparable to strong baselines, all while yielding more interpretable failure modes through heatmap shifts. This work contributes a framework for understanding and auditing hierarchical model behavior under adversarial perturbations, with potential implications for model explainability and robustness analysis in multi-stage vision systems.

Abstract

Paper Structure (9 sections, 10 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 9 sections, 10 equations, 2 figures, 2 tables, 1 algorithm.

Introduction
Background
Coarse-to-Fine Model Formulation
Layer-wise Relevance Propagation
LRP Attack Formulation
Fooling the Coarse Level
Fooling the Fine Level
Experimental Results
Conclusion

Figures (2)

Figure 1: LRP visualizations before and after LRPC and DFC attacks. (a1) LRP of the original coarse class ($r_{\text{org}}$: "vehicle") before the attack. (a2) LRP of the adversarial coarse class ($r_{\text{adv}}$: "clothes") before the attack. (a3) Benign image. (c1, d1, e1) LRP of $r_{\text{org}}$ after LRPC attack for $\epsilon = 10, 20, 40$, compared to (b1) for DFC. (c2, d2, e2) LRP of $r_{\text{adv}}$ after LRPC attack for $\epsilon = 10, 20, 40$, compared to (b2) for DFC. Perturbations generated with LRPC ($\epsilon = 10, 20, 40$) are shown in (c3, d3, e3), and for DFC in (b3). LRP norms and prediction scores are displayed below the respective cases.
Figure 2: Example of LRP visualization of a1) the original fine class ($f_{\text{org}}$) and a2) the adversarial fine class ($f_{\text{adv}}$) before attack. a3) Benign image. b1, c1) LRP of $f_{\text{org}}$ after LRPF attack with $\epsilon=10, 40$. b2, c2) LRP of $f_{\text{adv}}$ after LRPF attack with $\epsilon=10, 40$. LRP norms for each case are displayed below the respective heatmaps. b3, c3) Perturbation with LRPF for $\epsilon=10, 40$ and perturbed image. The prediction scores for each case are displayed below the respective images. $r_{\text{org}}$: "bird", $f_{\text{org}}$: "bunting" and $f_{\text{adv}}$: "jay".

Explainable Adversarial Attacks on Coarse-to-Fine Classifiers

TL;DR

Abstract

Explainable Adversarial Attacks on Coarse-to-Fine Classifiers

Authors

TL;DR

Abstract

Table of Contents

Figures (2)