Enhancing Adversarial Example Detection Through Model Explanation

Qian Ma; Ziping Ye

Enhancing Adversarial Example Detection Through Model Explanation

Qian Ma, Ziping Ye

TL;DR

The paper tackles the problem of adversarial vulnerability by evaluating AmI, an explanation-based detector that relies on model explanations to distinguish adversarial from clean inputs. It reveals that AmI's effectiveness is highly sensitive to hyperparameters and external factors, raising concerns about robustness and reproducibility. Through empirical analysis including C&W wb8 adversarial examples, the authors demonstrate that high reported detection rates can be achieved primarily by tuning a hyperparameter rather than inherent resilience, and that cross-environment differences further impact performance. The work advocates for a rigorous, environment-aware evaluation framework and outlines directions for developing more robust, explanation-driven defenses with robust metrics that account for false positives and false negatives in real-world settings.

Abstract

Adversarial examples are a major problem for machine learning models, leading to a continuous search for effective defenses. One promising direction is to leverage model explanations to better understand and defend against these attacks. We looked at AmI, a method proposed by a NeurIPS 2018 spotlight paper that uses model explanations to detect adversarial examples. Our study shows that while AmI is a promising idea, its performance is too dependent on specific settings (e.g., hyperparameter) and external factors such as the operating system and the deep learning framework used, and such drawbacks limit AmI's practical usage. Our findings highlight the need for more robust defense mechanisms that are effective under various conditions. In addition, we advocate for a comprehensive evaluation framework for defense techniques.

Enhancing Adversarial Example Detection Through Model Explanation

TL;DR

Abstract

Enhancing Adversarial Example Detection Through Model Explanation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)