SIFT-Graph: Benchmarking Multimodal Defense Against Image Adversarial Attacks With Robust Feature Graph
Jingjie He, Weijie Liang, Zihan Shan, Matthew Caesar
TL;DR
This paper tackles the vulnerability of modern vision models to adversarial perturbations by introducing SIFT-Graph, a multimodal defense that augments traditional backbones with a SIFT-based keypoint graph processed by a Graph Attention Network. By fusing robust local-structure embeddings with global semantic features from CNNs or Vision Transformers, the approach yields improved robustness under white-box PGD attacks across CIFAR-10, CIFAR-100, and Tiny ImageNet, with only minor losses in clean accuracy. The method is lightweight and plug-and-play, showing consistent gains across both transformer and CNN backbones without requiring adversarial training. While effective, the authors acknowledge information loss from SIFT and potential limitations against highly distorted attacks, and suggest future work to extend the framework to more tasks and architectures.
Abstract
Adversarial attacks expose a fundamental vulnerability in modern deep vision models by exploiting their dependence on dense, pixel-level representations that are highly sensitive to imperceptible perturbations. Traditional defense strategies typically operate within this fragile pixel domain, lacking mechanisms to incorporate inherently robust visual features. In this work, we introduce SIFT-Graph, a multimodal defense framework that enhances the robustness of traditional vision models by aggregating structurally meaningful features extracted from raw images using both handcrafted and learned modalities. Specifically, we integrate Scale-Invariant Feature Transform keypoints with a Graph Attention Network to capture scale and rotation invariant local structures that are resilient to perturbations. These robust feature embeddings are then fused with traditional vision model, such as Vision Transformer and Convolutional Neural Network, to form a unified, structure-aware and perturbation defensive model. Preliminary results demonstrate that our method effectively improves the visual model robustness against gradient-based white box adversarial attacks, while incurring only a marginal drop in clean accuracy.
