Table of Contents
Fetching ...

SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

Shravan Venkatraman, Jaskaran Singh Walia, Joe Dhanith P R

TL;DR

This work tackles the challenge of embedding multi-scale representations into Vision Transformers by proposing SAG-ViT, a framework that patches CNN-derived feature maps, constructs a k-connected graph of patches with similarity-based edges, refines patch features with a Graph Attention Network, and encodes them with a Transformer. The approach blends the strengths of CNNs for local multi-scale features, graph attention for structured relational modeling, and Transformers for global context, validated across six diverse datasets. Key contributions include high-fidelity feature map patching, a scale-aware graph construction, and the integration of GAT with Transformer encoders, all contributing to improved accuracy and efficiency. The empirical results demonstrate strong performance gains and favorable hardware utilization, indicating practical impact for scalable, domain-adaptive image classification.

Abstract

Vision Transformers (ViTs) have redefined image classification by leveraging self-attention to capture complex patterns and long-range dependencies between image patches. However, a key challenge for ViTs is efficiently incorporating multi-scale feature representations, which is inherent in convolutional neural networks (CNNs) through their hierarchical structure. Graph transformers have made strides in addressing this by leveraging graph-based modeling, but they often lose or insufficiently represent spatial hierarchies, especially since redundant or less relevant areas dilute the image's contextual representation. To bridge this gap, we propose SAG-ViT, a Scale-Aware Graph Attention ViT that integrates multi-scale feature capabilities of CNNs, representational power of ViTs, graph-attended patching to enable richer contextual representation. Using EfficientNetV2 as a backbone, the model extracts multi-scale feature maps, dividing them into patches to preserve richer semantic information compared to directly patching the input images. The patches are structured into a graph using spatial and feature similarities, where a Graph Attention Network (GAT) refines the node embeddings. This refined graph representation is then processed by a Transformer encoder, capturing long-range dependencies and complex interactions. We evaluate SAG-ViT on benchmark datasets across various domains, validating its effectiveness in advancing image classification tasks. Our code and weights are available at https://github.com/shravan-18/SAG-ViT.

SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

TL;DR

This work tackles the challenge of embedding multi-scale representations into Vision Transformers by proposing SAG-ViT, a framework that patches CNN-derived feature maps, constructs a k-connected graph of patches with similarity-based edges, refines patch features with a Graph Attention Network, and encodes them with a Transformer. The approach blends the strengths of CNNs for local multi-scale features, graph attention for structured relational modeling, and Transformers for global context, validated across six diverse datasets. Key contributions include high-fidelity feature map patching, a scale-aware graph construction, and the integration of GAT with Transformer encoders, all contributing to improved accuracy and efficiency. The empirical results demonstrate strong performance gains and favorable hardware utilization, indicating practical impact for scalable, domain-adaptive image classification.

Abstract

Vision Transformers (ViTs) have redefined image classification by leveraging self-attention to capture complex patterns and long-range dependencies between image patches. However, a key challenge for ViTs is efficiently incorporating multi-scale feature representations, which is inherent in convolutional neural networks (CNNs) through their hierarchical structure. Graph transformers have made strides in addressing this by leveraging graph-based modeling, but they often lose or insufficiently represent spatial hierarchies, especially since redundant or less relevant areas dilute the image's contextual representation. To bridge this gap, we propose SAG-ViT, a Scale-Aware Graph Attention ViT that integrates multi-scale feature capabilities of CNNs, representational power of ViTs, graph-attended patching to enable richer contextual representation. Using EfficientNetV2 as a backbone, the model extracts multi-scale feature maps, dividing them into patches to preserve richer semantic information compared to directly patching the input images. The patches are structured into a graph using spatial and feature similarities, where a Graph Attention Network (GAT) refines the node embeddings. This refined graph representation is then processed by a Transformer encoder, capturing long-range dependencies and complex interactions. We evaluate SAG-ViT on benchmark datasets across various domains, validating its effectiveness in advancing image classification tasks. Our code and weights are available at https://github.com/shravan-18/SAG-ViT.

Paper Structure

This paper contains 12 sections, 7 equations, 8 figures, 9 tables, 2 algorithms.

Figures (8)

  • Figure 1: Resource usage and computational complexity comparison across different methods. CNN names refer to a [CNN$\rightarrow$ViT+GAT] architecture. For comparison, ViT-S and ViT-L denote standard Vision Transformer models without the CNN or GAT components.
  • Figure 2: Visualization of the patch generation and graph construction pipeline in SAG-ViT.
  • Figure 3: An illustration of our proposed SAG-ViT architecture for learning scale-aware, high-fidelity features with graph attention.
  • Figure 4: Feature embeddings projected into a 2D space using UMAP for DeiT, Vanilla ViT, and SAG-ViT (Proposed).
  • Figure 5: Token-token correlation matrix comparison of SAG-ViT with DeiT and vanilla ViT.
  • ...and 3 more figures