CausAdv: A Causal-based Framework for Detecting Adversarial Examples
Hichem Debbi
TL;DR
CausAdv introduces a causal, counterfactual framework for detecting adversarial examples in CNNs by analyzing Counterfactual Information (CI) of filters in the last convolutional layer. Prototypes are used to define causal versus non-causal filters and to compute CI via ablation, enabling four detection strategies that distinguish clean from adversarial inputs without modifying the input or additional detectors. Empirical results on ImageNet and CIFAR-10 show strong detection capabilities, particularly against BIM and FGSM-like attacks, and demonstrate superior performance over several existing defenses while providing interpretable causal explanations. The approach highlights the value of causal reasoning for robustness and offers a lightweight, architecture-agnostic, and explainable detection mechanism with practical applicability. CI-based visualizations further support the interpretability of decisions and may aid in qualitative adversarial analysis and localization tasks.
Abstract
Deep learning has led to tremendous success in computer vision, largely due to Convolutional Neural Networks (CNNs). However, CNNs have been shown to be vulnerable to crafted adversarial perturbations. This vulnerability of adversarial examples has has motivated research into improving model robustness through adversarial detection and defense methods. In this paper, we address the adversarial robustness of CNNs through causal reasoning. We propose CausAdv: a causal framework for detecting adversarial examples based on counterfactual reasoning. CausAdv learns both causal and non-causal features of every input, and quantifies the counterfactual information (CI) of every filter of the last convolutional layer. We then perform a statistical analysis of the filters' CI across clean and adversarial samples, to demonstrate that adversarial examples exhibit different CI distributions compared to clean samples. Our results show that causal reasoning enhances the process of adversarial detection without the need to train a separate detector. Moreover, we illustrate the efficiency of causal explanations as a helpful detection tool by visualizing the extracted causal features.
