SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

Shravan Venkatraman; Jaskaran Singh Walia; Joe Dhanith P R

SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

Shravan Venkatraman, Jaskaran Singh Walia, Joe Dhanith P R

TL;DR

This work tackles the challenge of embedding multi-scale representations into Vision Transformers by proposing SAG-ViT, a framework that patches CNN-derived feature maps, constructs a k-connected graph of patches with similarity-based edges, refines patch features with a Graph Attention Network, and encodes them with a Transformer. The approach blends the strengths of CNNs for local multi-scale features, graph attention for structured relational modeling, and Transformers for global context, validated across six diverse datasets. Key contributions include high-fidelity feature map patching, a scale-aware graph construction, and the integration of GAT with Transformer encoders, all contributing to improved accuracy and efficiency. The empirical results demonstrate strong performance gains and favorable hardware utilization, indicating practical impact for scalable, domain-adaptive image classification.

Abstract

Vision Transformers (ViTs) have redefined image classification by leveraging self-attention to capture complex patterns and long-range dependencies between image patches. However, a key challenge for ViTs is efficiently incorporating multi-scale feature representations, which is inherent in convolutional neural networks (CNNs) through their hierarchical structure. Graph transformers have made strides in addressing this by leveraging graph-based modeling, but they often lose or insufficiently represent spatial hierarchies, especially since redundant or less relevant areas dilute the image's contextual representation. To bridge this gap, we propose SAG-ViT, a Scale-Aware Graph Attention ViT that integrates multi-scale feature capabilities of CNNs, representational power of ViTs, graph-attended patching to enable richer contextual representation. Using EfficientNetV2 as a backbone, the model extracts multi-scale feature maps, dividing them into patches to preserve richer semantic information compared to directly patching the input images. The patches are structured into a graph using spatial and feature similarities, where a Graph Attention Network (GAT) refines the node embeddings. This refined graph representation is then processed by a Transformer encoder, capturing long-range dependencies and complex interactions. We evaluate SAG-ViT on benchmark datasets across various domains, validating its effectiveness in advancing image classification tasks. Our code and weights are available at https://github.com/shravan-18/SAG-ViT.

SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

TL;DR

Abstract

SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)