Table of Contents
Fetching ...

AttentionViG: Cross-Attention-Based Dynamic Neighbor Aggregation in Vision GNNs

Hakan Emre Gedik, Andrew Martin, Mustafa Munir, Oguzhan Baser, Radu Marculescu, Sandeep P. Chinchali, Alan C. Bovik

TL;DR

This paper introduces a cross-attention-based neighbor aggregation for Vision Graph Neural Networks, where node-derived queries attend to neighbor keys to produce adaptive, non-local message passing. The proposed Grapher layer combines this aggregation with FFNs and conditional positional encoding, forming AttentionViG, a multiscale CNN–GNN backbone that relies on SVGA graph construction for efficiency. Across ImageNet-1K, COCO, and ADE20K, AttentionViG achieves state-of-the-art or competitive results while maintaining lower computational cost than many baselines, and visualization confirms the model learns semantically meaningful neighbor weighting. The approach robustly handles imperfect graph construction and has potential for extension to video, point clouds, and other structured data domains.

Abstract

Vision Graph Neural Networks (ViGs) have demonstrated promising performance in image recognition tasks against Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). An essential part of the ViG framework is the node-neighbor feature aggregation method. Although various graph convolution methods, such as Max-Relative, EdgeConv, GIN, and GraphSAGE, have been explored, a versatile aggregation method that effectively captures complex node-neighbor relationships without requiring architecture-specific refinements is needed. To address this gap, we propose a cross-attention-based aggregation method in which the query projections come from the node, while the key projections come from its neighbors. Additionally, we introduce a novel architecture called AttentionViG that uses the proposed cross-attention aggregation scheme to conduct non-local message passing. We evaluated the image recognition performance of AttentionViG on the ImageNet-1K benchmark, where it achieved SOTA performance. Additionally, we assessed its transferability to downstream tasks, including object detection and instance segmentation on MS COCO 2017, as well as semantic segmentation on ADE20K. Our results demonstrate that the proposed method not only achieves strong performance, but also maintains efficiency, delivering competitive accuracy with comparable FLOPs to prior vision GNN architectures.

AttentionViG: Cross-Attention-Based Dynamic Neighbor Aggregation in Vision GNNs

TL;DR

This paper introduces a cross-attention-based neighbor aggregation for Vision Graph Neural Networks, where node-derived queries attend to neighbor keys to produce adaptive, non-local message passing. The proposed Grapher layer combines this aggregation with FFNs and conditional positional encoding, forming AttentionViG, a multiscale CNN–GNN backbone that relies on SVGA graph construction for efficiency. Across ImageNet-1K, COCO, and ADE20K, AttentionViG achieves state-of-the-art or competitive results while maintaining lower computational cost than many baselines, and visualization confirms the model learns semantically meaningful neighbor weighting. The approach robustly handles imperfect graph construction and has potential for extension to video, point clouds, and other structured data domains.

Abstract

Vision Graph Neural Networks (ViGs) have demonstrated promising performance in image recognition tasks against Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). An essential part of the ViG framework is the node-neighbor feature aggregation method. Although various graph convolution methods, such as Max-Relative, EdgeConv, GIN, and GraphSAGE, have been explored, a versatile aggregation method that effectively captures complex node-neighbor relationships without requiring architecture-specific refinements is needed. To address this gap, we propose a cross-attention-based aggregation method in which the query projections come from the node, while the key projections come from its neighbors. Additionally, we introduce a novel architecture called AttentionViG that uses the proposed cross-attention aggregation scheme to conduct non-local message passing. We evaluated the image recognition performance of AttentionViG on the ImageNet-1K benchmark, where it achieved SOTA performance. Additionally, we assessed its transferability to downstream tasks, including object detection and instance segmentation on MS COCO 2017, as well as semantic segmentation on ADE20K. Our results demonstrate that the proposed method not only achieves strong performance, but also maintains efficiency, delivering competitive accuracy with comparable FLOPs to prior vision GNN architectures.

Paper Structure

This paper contains 17 sections, 11 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Cross-attention assigns weights to neighbors, with bolder edges representing higher weights. For the node $\mathbf{x}_i^l$ and its neighbors $\mathbf{y}_{i, 0}^l, \mathbf{y}_{i, 1}^l, \mathbf{y}_{i, 2}^l, \mathbf{y}_{i, 3}^l$, the corresponding neighbor weights are $\alpha_{i, 0}^l, \alpha_{i, 1}^l, \alpha_{i, 2}^l, \alpha_{i, 3}^l$.
  • Figure 2: Cross-attention-based feature aggregation extracts query vectors from the nodes and key vectors from the neighbors, enabling the model to learn the relative importance of each neighbor to a given node.
  • Figure 3: The overall architecture of AttentionViG consists of a stem, inverted residual blocks (IRB) for feature extraction, Grapher layers for graph-based feature aggregation, and downsampling blocks for multi-scale representation learning.
  • Figure 4: (a) Convolutional stem for input image embeddings, where convolutional layers have a stride of 2. (b) Grapher layer with CPE conditionalpe and the proposed cross-attention aggregation. (c) FFN layer, a component of the Grapher. (d) Downsampling block with a convolutional layer of stride 2. (e) Inverted residual block as introduced in mobilenetv2.
  • Figure 5: SVGA graph construction policy from mobilevig. The central patch (red) represents the node, while surrounding patches (blue) are its neighbors, assigned in a criss-cross pattern.
  • ...and 2 more figures