Table of Contents
Fetching ...

SIFT-Graph: Benchmarking Multimodal Defense Against Image Adversarial Attacks With Robust Feature Graph

Jingjie He, Weijie Liang, Zihan Shan, Matthew Caesar

TL;DR

This paper tackles the vulnerability of modern vision models to adversarial perturbations by introducing SIFT-Graph, a multimodal defense that augments traditional backbones with a SIFT-based keypoint graph processed by a Graph Attention Network. By fusing robust local-structure embeddings with global semantic features from CNNs or Vision Transformers, the approach yields improved robustness under white-box PGD attacks across CIFAR-10, CIFAR-100, and Tiny ImageNet, with only minor losses in clean accuracy. The method is lightweight and plug-and-play, showing consistent gains across both transformer and CNN backbones without requiring adversarial training. While effective, the authors acknowledge information loss from SIFT and potential limitations against highly distorted attacks, and suggest future work to extend the framework to more tasks and architectures.

Abstract

Adversarial attacks expose a fundamental vulnerability in modern deep vision models by exploiting their dependence on dense, pixel-level representations that are highly sensitive to imperceptible perturbations. Traditional defense strategies typically operate within this fragile pixel domain, lacking mechanisms to incorporate inherently robust visual features. In this work, we introduce SIFT-Graph, a multimodal defense framework that enhances the robustness of traditional vision models by aggregating structurally meaningful features extracted from raw images using both handcrafted and learned modalities. Specifically, we integrate Scale-Invariant Feature Transform keypoints with a Graph Attention Network to capture scale and rotation invariant local structures that are resilient to perturbations. These robust feature embeddings are then fused with traditional vision model, such as Vision Transformer and Convolutional Neural Network, to form a unified, structure-aware and perturbation defensive model. Preliminary results demonstrate that our method effectively improves the visual model robustness against gradient-based white box adversarial attacks, while incurring only a marginal drop in clean accuracy.

SIFT-Graph: Benchmarking Multimodal Defense Against Image Adversarial Attacks With Robust Feature Graph

TL;DR

This paper tackles the vulnerability of modern vision models to adversarial perturbations by introducing SIFT-Graph, a multimodal defense that augments traditional backbones with a SIFT-based keypoint graph processed by a Graph Attention Network. By fusing robust local-structure embeddings with global semantic features from CNNs or Vision Transformers, the approach yields improved robustness under white-box PGD attacks across CIFAR-10, CIFAR-100, and Tiny ImageNet, with only minor losses in clean accuracy. The method is lightweight and plug-and-play, showing consistent gains across both transformer and CNN backbones without requiring adversarial training. While effective, the authors acknowledge information loss from SIFT and potential limitations against highly distorted attacks, and suggest future work to extend the framework to more tasks and architectures.

Abstract

Adversarial attacks expose a fundamental vulnerability in modern deep vision models by exploiting their dependence on dense, pixel-level representations that are highly sensitive to imperceptible perturbations. Traditional defense strategies typically operate within this fragile pixel domain, lacking mechanisms to incorporate inherently robust visual features. In this work, we introduce SIFT-Graph, a multimodal defense framework that enhances the robustness of traditional vision models by aggregating structurally meaningful features extracted from raw images using both handcrafted and learned modalities. Specifically, we integrate Scale-Invariant Feature Transform keypoints with a Graph Attention Network to capture scale and rotation invariant local structures that are resilient to perturbations. These robust feature embeddings are then fused with traditional vision model, such as Vision Transformer and Convolutional Neural Network, to form a unified, structure-aware and perturbation defensive model. Preliminary results demonstrate that our method effectively improves the visual model robustness against gradient-based white box adversarial attacks, while incurring only a marginal drop in clean accuracy.

Paper Structure

This paper contains 27 sections, 14 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (a)Demonstration of the SIFT-Graph central model component and workflow. (b)Detailed design for graph encoder, where the node attribution refers to combination of position coordinate(2), direction(1), response(1) and size(1), where the n refers to the number of nodes.
  • Figure 2: Example of SIFT keypoints under Gaussian noise perturbations. Each $\epsilon$ denotes the noise intensity, represented as standard deviation on the 0--255 pixel scale.
  • Figure 3: Visualization of SIFT-based $k$-nearest neighbor graphs constructed with varying values of $k$.
  • Figure 4: Robustness evaluation under PGD attacks for ViT vs. SIFTGraph enhanced ViT on multiple datasets.
  • Figure 5: Robustness evaluation under PGD attacks for ResNet50 vs. SIFTGraph enhanced ResNet50 on multiple datasets.