Table of Contents
Fetching ...

GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs

Mustafa Munir, William Avery, Md Mostafijur Rahman, Radu Marculescu

TL;DR

GreedyViG introduces Dynamic Axial Graph Construction (DAGC) to replace expensive KNN-based graph construction in Vision GNNs, enabling dynamic yet efficient connectivity within images. The method is paired with a CNN-GNN backbone, GreedyViG, which alternates MBConv local processing with DAGC global message passing and uses Conditional Positional Encoding to incorporate spatial context. Across ImageNet-1K, COCO, and ADE20K, GreedyViG achieves state-of-the-art or competitive accuracy with substantially lower GMACs and parameters compared to ViG, ViHGNN, and MobileViG, demonstrating the viability of a dynamic, axial graph approach for efficient vision backbones. The work highlights that combining local CNN processing with dynamic graph convolution can outperform current SOTA while maintaining efficiency, suggesting a strong practical impact for real-world vision systems.

Abstract

Vision graph neural networks (ViG) offer a new avenue for exploration in computer vision. A major bottleneck in ViGs is the inefficient k-nearest neighbor (KNN) operation used for graph construction. To solve this issue, we propose a new method for designing ViGs, Dynamic Axial Graph Construction (DAGC), which is more efficient than KNN as it limits the number of considered graph connections made within an image. Additionally, we propose a novel CNN-GNN architecture, GreedyViG, which uses DAGC. Extensive experiments show that GreedyViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification, object detection, instance segmentation, and semantic segmentation tasks. Our smallest model, GreedyViG-S, achieves 81.1% top-1 accuracy on ImageNet-1K, 2.9% higher than Vision GNN and 2.2% higher than Vision HyperGraph Neural Network (ViHGNN), with less GMACs and a similar number of parameters. Our largest model, GreedyViG-B obtains 83.9% top-1 accuracy, 0.2% higher than Vision GNN, with a 66.6% decrease in parameters and a 69% decrease in GMACs. GreedyViG-B also obtains the same accuracy as ViHGNN with a 67.3% decrease in parameters and a 71.3% decrease in GMACs. Our work shows that hybrid CNN-GNN architectures not only provide a new avenue for designing efficient models, but that they can also exceed the performance of current state-of-the-art models.

GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs

TL;DR

GreedyViG introduces Dynamic Axial Graph Construction (DAGC) to replace expensive KNN-based graph construction in Vision GNNs, enabling dynamic yet efficient connectivity within images. The method is paired with a CNN-GNN backbone, GreedyViG, which alternates MBConv local processing with DAGC global message passing and uses Conditional Positional Encoding to incorporate spatial context. Across ImageNet-1K, COCO, and ADE20K, GreedyViG achieves state-of-the-art or competitive accuracy with substantially lower GMACs and parameters compared to ViG, ViHGNN, and MobileViG, demonstrating the viability of a dynamic, axial graph approach for efficient vision backbones. The work highlights that combining local CNN processing with dynamic graph convolution can outperform current SOTA while maintaining efficiency, suggesting a strong practical impact for real-world vision systems.

Abstract

Vision graph neural networks (ViG) offer a new avenue for exploration in computer vision. A major bottleneck in ViGs is the inefficient k-nearest neighbor (KNN) operation used for graph construction. To solve this issue, we propose a new method for designing ViGs, Dynamic Axial Graph Construction (DAGC), which is more efficient than KNN as it limits the number of considered graph connections made within an image. Additionally, we propose a novel CNN-GNN architecture, GreedyViG, which uses DAGC. Extensive experiments show that GreedyViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification, object detection, instance segmentation, and semantic segmentation tasks. Our smallest model, GreedyViG-S, achieves 81.1% top-1 accuracy on ImageNet-1K, 2.9% higher than Vision GNN and 2.2% higher than Vision HyperGraph Neural Network (ViHGNN), with less GMACs and a similar number of parameters. Our largest model, GreedyViG-B obtains 83.9% top-1 accuracy, 0.2% higher than Vision GNN, with a 66.6% decrease in parameters and a 69% decrease in GMACs. GreedyViG-B also obtains the same accuracy as ViHGNN with a 67.3% decrease in parameters and a 71.3% decrease in GMACs. Our work shows that hybrid CNN-GNN architectures not only provide a new avenue for designing efficient models, but that they can also exceed the performance of current state-of-the-art models.
Paper Structure (16 sections, 2 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 16 sections, 2 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of model size and performance (top-1 accuracy on ImageNet-1K). GreedyViG achieves the highest performance compared to other state-of-the-art models.
  • Figure 2: DAGC and SVGA graph construction. a) SVGA graph construction for the green patch of an 8$\times$8 image. All red patches will be connected to the green patch regardless of similarity. b) DAGC for the green patch of an 8$\times$8 image. DAGC dynamically constructs a graph along the axes, through applying a mask (the blue patches) to only connect similar patches in terms of Euclidean distance. The red patches will not be connected to the green patch as they are not a part of the mask.
  • Figure 3: Euclidean distance calculation between the original image and the image with its quadrants flipped along the diagonal.
  • Figure 4: GreedyViG architecture. (a) Network architecture showing the stages and blocks. (b) The Conv Stem. (c) MBConv Block. (d) Downsample. (e) DAGC Block. (f) Dynamic Grapher. (g) FFN.
  • Figure 5: Comparison of model size and performance (mIoU on ADE20K). GreedyViG achieves the highest performance on all model sizes compared to other state-of-the-art models. a) shows performance compared to parameters and b) shows performance compared to GMACs.