Table of Contents
Fetching ...

AdaptViG: Adaptive Vision GNN with Exponential Decay Gating

Mustafa Munir, Md Mostafijur Rahman, Radu Marculescu

TL;DR

AdaptViG addresses the high computational cost of graph construction in Vision Graph Neural Networks by introducing Adaptive Graph Convolution (AGC), which blends a static scaffold of local and logarithmic connections with a learnable Exponential Decay Gating to weight long-range edges by feature similarity. A final-stage Global Attention mixer is employed in a four-stage backbone to maximize feature aggregation at low resolution, while early stages exploit efficient gating for high-resolution processing. Across ImageNet-1K and downstream tasks (COCO, ADE20K), AdaptViG achieves a new state-of-the-art Pareto frontier, e.g., AdaptViG-M attains 82.6% top-1 on ImageNet with substantially fewer parameters and GMACs, and downstream metrics exceed larger baselines with significantly reduced compute. This demonstrates that a carefully designed hybrid Vision GNN can outperform leading CNN and ViT architectures in both accuracy and efficiency.

Abstract

Vision Graph Neural Networks (ViGs) offer a new direction for advancements in vision architectures. While powerful, ViGs often face substantial computational challenges stemming from their graph construction phase, which can hinder their efficiency. To address this issue we propose AdaptViG, an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution. This mechanism builds upon a highly efficient static axial scaffold and a dynamic, content-aware gating strategy called Exponential Decay Gating. This gating mechanism selectively weighs long-range connections based on feature similarity. Furthermore, AdaptViG employs a hybrid strategy, utilizing our efficient gating mechanism in the early stages and a full Global Attention block in the final stage for maximum feature aggregation. Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs. For instance, our AdaptViG-M achieves 82.6% top-1 accuracy, outperforming ViG-B by 0.3% while using 80% fewer parameters and 84% fewer GMACs. On downstream tasks, AdaptViG-M obtains 45.8 mIoU, 44.8 APbox, and 41.1 APmask, surpassing the much larger EfficientFormer-L7 by 0.7 mIoU, 2.2 APbox, and 2.1 APmask, respectively, with 78% fewer parameters.

AdaptViG: Adaptive Vision GNN with Exponential Decay Gating

TL;DR

AdaptViG addresses the high computational cost of graph construction in Vision Graph Neural Networks by introducing Adaptive Graph Convolution (AGC), which blends a static scaffold of local and logarithmic connections with a learnable Exponential Decay Gating to weight long-range edges by feature similarity. A final-stage Global Attention mixer is employed in a four-stage backbone to maximize feature aggregation at low resolution, while early stages exploit efficient gating for high-resolution processing. Across ImageNet-1K and downstream tasks (COCO, ADE20K), AdaptViG achieves a new state-of-the-art Pareto frontier, e.g., AdaptViG-M attains 82.6% top-1 on ImageNet with substantially fewer parameters and GMACs, and downstream metrics exceed larger baselines with significantly reduced compute. This demonstrates that a carefully designed hybrid Vision GNN can outperform leading CNN and ViT architectures in both accuracy and efficiency.

Abstract

Vision Graph Neural Networks (ViGs) offer a new direction for advancements in vision architectures. While powerful, ViGs often face substantial computational challenges stemming from their graph construction phase, which can hinder their efficiency. To address this issue we propose AdaptViG, an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution. This mechanism builds upon a highly efficient static axial scaffold and a dynamic, content-aware gating strategy called Exponential Decay Gating. This gating mechanism selectively weighs long-range connections based on feature similarity. Furthermore, AdaptViG employs a hybrid strategy, utilizing our efficient gating mechanism in the early stages and a full Global Attention block in the final stage for maximum feature aggregation. Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs. For instance, our AdaptViG-M achieves 82.6% top-1 accuracy, outperforming ViG-B by 0.3% while using 80% fewer parameters and 84% fewer GMACs. On downstream tasks, AdaptViG-M obtains 45.8 mIoU, 44.8 APbox, and 41.1 APmask, surpassing the much larger EfficientFormer-L7 by 0.7 mIoU, 2.2 APbox, and 2.1 APmask, respectively, with 78% fewer parameters.

Paper Structure

This paper contains 44 sections, 16 equations, 6 figures, 16 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of model efficiency and performance on ImageNet-1K. AdaptViG establishes a new state-of-the-art Pareto frontier, achieving higher top-1 accuracy than competing models with (a) a lower number of parameters and (b) a lower number of GMACs.
  • Figure 2: Comparison of SVGA vs. our AGC.a) SVGA establishes fixed, unweighted connections to nodes based on a fixed number of hops. b) Our proposed AGC creates immediate local connections along with content-aware distant connections where the strength is weighted by feature similarity using Exponential Decay Gating, visualized by the intensity of the blue color.
  • Figure 3: The AdaptViG Architecture.(a) The overall 4-stage hierarchical architecture. Stages 1-3 use our AGC Block, while Stage 4 uses our attention block. (b) The initial convolutional stem. (c) The Inverted Residual Block (IRB) for local feature processing. (d) The downsample block used between stages. (e) The full AGC Block, which contains our AGC mixer and an FFN. (f) The Attention Block, which replaces the AGC mixer with the self-attention mechanism. (g) The Feed-Forward Network (FFN) used in both block types.
  • Figure 4: Downstream task performance vs. model size. AdaptViG establishes a new state-of-the-art Pareto frontier on all three downstream tasks, outperforming competing backbones. (a) Object detection performance ($AP^{box}$) on MS-COCO. (b) Instance segmentation performance ($AP^{mask}$) on MS-COCO. (c) Semantic segmentation performance (mIoU) on ADE20K.
  • Figure 5: Accuracy vs. Throughput comparison on ImageNet-1k. We plot Top-1 Accuracy against inference throughput (images/sec). AdaptViG clearly establishes a new state-of-the-art Pareto frontier, demonstrating significantly higher throughput and accuracy than competing models.
  • ...and 1 more figures