Table of Contents
Fetching ...

FeatureLens: A Highly Generalizable and Interpretable Framework for Detecting Adversarial Examples Based on Image Features

Zhigang Yang, Yuan Liu, Jiawei Zhang, Puning Zhang, Xinqiang Ma

TL;DR

FeatureLens provides a model-agnostic adversarial detection framework that relies on a compact 51-dimensional image feature space spanning frequency, gradient, edge/texture, and distributional-shift statistics. By training shallow classifiers on these interpretable features, it achieves high closed-set accuracy and robust cross-attack generalization across FGSM, PGD, CW, and DAmageNet, while remaining computationally lightweight. Theoretical analysis demonstrates linear separability in feature space, and attribution studies confirm frequency- and gradient-based features as primary drivers, supporting interpretability. The approach extends to new attack modalities, such as Visual Jailbreak, and offers practical deployment benefits for edge and embedded systems due to its model-agnostic, low-parameter design.

Abstract

Although the remarkable performance of deep neural networks (DNNs) in image classification, their vulnerability to adversarial attacks remains a critical challenge. Most existing detection methods rely on complex and poorly interpretable architectures, which compromise interpretability and generalization. To address this, we propose FeatureLens, a lightweight framework that acts as a lens to scrutinize anomalies in image features. Comprising an Image Feature Extractor (IFE) and shallow classifiers (e.g., SVM, MLP, or XGBoost) with model sizes ranging from 1,000 to 30,000 parameters, FeatureLens achieves high detection accuracy ranging from 97.8% to 99.75% in closed-set evaluation and 86.17% to 99.6% in generalization evaluation across FGSM, PGD, CW, and DAmageNet attacks, using only 51 dimensional features. By combining strong detection performance with excellent generalization, interpretability, and computational efficiency, FeatureLens offers a practical pathway toward transparent and effective adversarial defense.

FeatureLens: A Highly Generalizable and Interpretable Framework for Detecting Adversarial Examples Based on Image Features

TL;DR

FeatureLens provides a model-agnostic adversarial detection framework that relies on a compact 51-dimensional image feature space spanning frequency, gradient, edge/texture, and distributional-shift statistics. By training shallow classifiers on these interpretable features, it achieves high closed-set accuracy and robust cross-attack generalization across FGSM, PGD, CW, and DAmageNet, while remaining computationally lightweight. Theoretical analysis demonstrates linear separability in feature space, and attribution studies confirm frequency- and gradient-based features as primary drivers, supporting interpretability. The approach extends to new attack modalities, such as Visual Jailbreak, and offers practical deployment benefits for edge and embedded systems due to its model-agnostic, low-parameter design.

Abstract

Although the remarkable performance of deep neural networks (DNNs) in image classification, their vulnerability to adversarial attacks remains a critical challenge. Most existing detection methods rely on complex and poorly interpretable architectures, which compromise interpretability and generalization. To address this, we propose FeatureLens, a lightweight framework that acts as a lens to scrutinize anomalies in image features. Comprising an Image Feature Extractor (IFE) and shallow classifiers (e.g., SVM, MLP, or XGBoost) with model sizes ranging from 1,000 to 30,000 parameters, FeatureLens achieves high detection accuracy ranging from 97.8% to 99.75% in closed-set evaluation and 86.17% to 99.6% in generalization evaluation across FGSM, PGD, CW, and DAmageNet attacks, using only 51 dimensional features. By combining strong detection performance with excellent generalization, interpretability, and computational efficiency, FeatureLens offers a practical pathway toward transparent and effective adversarial defense.

Paper Structure

This paper contains 26 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of the proposed FeatureLens framework. Clean and perturbed images—generated by unknown attack methods such as FGSM, PGD, C&W, and DAmageNet—are processed through a 51-dimensional image feature extractor covering frequency, gradient, edge, texture, and MMD statistics. These image features are classified by a shallow model (SVM, MLP, or XGBoost) to determine whether the input is adversarial. The table on the right presents detection accuracies of XGBoost across cro3ss-attack settings; the complete results are provided in Section 5.3.
  • Figure 2: XGBoost feature importance results based on the gain metric. This metric quantifies the average reduction in the loss function contributed by splits involving each feature, reflecting its overall influence in decision-making. The results show that frequency-domain features (e.g., MidFreqRatio, FreqEntropy) exhibit strong discriminative power across multi-dimensional representations, while gradient histogram features (e.g., GradHist_20) capture structural and textural information essential for distinguishing perturbation patterns. Their prominence suggests that the model primarily relies on frequency and gradient cues to identify subtle adversarial artifacts, underscoring the core role of structural information in adversarial detection.