Table of Contents
Fetching ...

ClusterViG: Efficient Globally Aware Vision GNNs via Image Partitioning

Dhruv Parikh, Jacob Fein-Ashley, Tian Ye, Rajgopal Kannan, Viktor Prasanna

TL;DR

This work tackles the high computational cost and limited flexibility of Vision Graph Neural Networks (ViGs) by introducing Dynamic Efficient Graph Convolution (DEGC), a partitioned, parallelizable graph-construction method. Building on DEGC, ClusterViG combines local graph learning with global inter-partition interactions to achieve globally aware, efficient ViG backbones. Empirical results across ImageNet classification and COCO detection/segmentation show up to 5x faster end-to-end inference and state-of-the-art performance at comparable parameter counts, with the ability to train on higher-resolution images. The approach yields an isotropic, scalable ViG backbone suitable for real-time CV tasks and edge deployments, with future work focusing on partition strategies, G-GCN variants, parameter tuning, and hardware acceleration.

Abstract

Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have dominated the field of Computer Vision (CV). Graph Neural Networks (GNN) have performed remarkably well across diverse domains because they can represent complex relationships via unstructured graphs. However, the applicability of GNNs for visual tasks was unexplored till the introduction of Vision GNNs (ViG). Despite the success of ViGs, their performance is severely bottlenecked due to the expensive $k$-Nearest Neighbors ($k$-NN) based graph construction. Recent works addressing this bottleneck impose constraints on the flexibility of GNNs to build unstructured graphs, undermining their core advantage while introducing additional inefficiencies. To address these issues, in this paper, we propose a novel method called Dynamic Efficient Graph Convolution (DEGC) for designing efficient and globally aware ViGs. DEGC partitions the input image and constructs graphs in parallel for each partition, improving graph construction efficiency. Further, DEGC integrates local intra-graph and global inter-graph feature learning, enabling enhanced global context awareness. Using DEGC as a building block, we propose a novel CNN-GNN architecture, ClusterViG, for CV tasks. Extensive experiments indicate that ClusterViG reduces end-to-end inference latency for vision tasks by up to $5\times$ when compared against a suite of models such as ViG, ViHGNN, PVG, and GreedyViG, with a similar model parameter count. Additionally, ClusterViG reaches state-of-the-art performance on image classification, object detection, and instance segmentation tasks, demonstrating the effectiveness of the proposed globally aware learning strategy. Finally, input partitioning performed by DEGC enables ClusterViG to be trained efficiently on higher-resolution images, underscoring the scalability of our approach.

ClusterViG: Efficient Globally Aware Vision GNNs via Image Partitioning

TL;DR

This work tackles the high computational cost and limited flexibility of Vision Graph Neural Networks (ViGs) by introducing Dynamic Efficient Graph Convolution (DEGC), a partitioned, parallelizable graph-construction method. Building on DEGC, ClusterViG combines local graph learning with global inter-partition interactions to achieve globally aware, efficient ViG backbones. Empirical results across ImageNet classification and COCO detection/segmentation show up to 5x faster end-to-end inference and state-of-the-art performance at comparable parameter counts, with the ability to train on higher-resolution images. The approach yields an isotropic, scalable ViG backbone suitable for real-time CV tasks and edge deployments, with future work focusing on partition strategies, G-GCN variants, parameter tuning, and hardware acceleration.

Abstract

Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have dominated the field of Computer Vision (CV). Graph Neural Networks (GNN) have performed remarkably well across diverse domains because they can represent complex relationships via unstructured graphs. However, the applicability of GNNs for visual tasks was unexplored till the introduction of Vision GNNs (ViG). Despite the success of ViGs, their performance is severely bottlenecked due to the expensive -Nearest Neighbors (-NN) based graph construction. Recent works addressing this bottleneck impose constraints on the flexibility of GNNs to build unstructured graphs, undermining their core advantage while introducing additional inefficiencies. To address these issues, in this paper, we propose a novel method called Dynamic Efficient Graph Convolution (DEGC) for designing efficient and globally aware ViGs. DEGC partitions the input image and constructs graphs in parallel for each partition, improving graph construction efficiency. Further, DEGC integrates local intra-graph and global inter-graph feature learning, enabling enhanced global context awareness. Using DEGC as a building block, we propose a novel CNN-GNN architecture, ClusterViG, for CV tasks. Extensive experiments indicate that ClusterViG reduces end-to-end inference latency for vision tasks by up to when compared against a suite of models such as ViG, ViHGNN, PVG, and GreedyViG, with a similar model parameter count. Additionally, ClusterViG reaches state-of-the-art performance on image classification, object detection, and instance segmentation tasks, demonstrating the effectiveness of the proposed globally aware learning strategy. Finally, input partitioning performed by DEGC enables ClusterViG to be trained efficiently on higher-resolution images, underscoring the scalability of our approach.
Paper Structure (16 sections, 6 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 16 sections, 6 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Comparing methods. a) CNN treats an image as a grid of patches; b) ViT treats an image as a sequence of patches and computes a dense attention matrix that compares each patch (blue) to all other patches (red); c) ViG connects a patch to similar patches, via $k$-NN; d) ViHGNN uses a hypergraph to represent image patches (a hyperedge contains several patches/nodes, shown as colored boxes); e) GreedyViG uses SVGA to connect a patch to its spatial horizontal and vertical axis patches and prunes connections (light blue); e) ClusterViG partitions the patches for fast graph construction and performs local-global feature learning (global features shown via star).
  • Figure 2: Dynamic Efficient Graph Convolution module. DEGC performs four main operations: (i) Input Partitioning. $\bm{X}$ is partitioned into $\kappa$ partitions via clustering. (ii) Graph Construction. A $k$-NN graph is constructed in each partition, in parallel across all partitions. (iii) Global Update. Each partition's global feature vector, $\bm{z}_i$, is updated via GATv2 through global inter-partition interactions, yielding $\bm{z}_i'$. (iv) Local Update. Each partition updates its patch features by incorporating $\bm{z}_i'$, via intra-partition interactions, in parallel, yielding $\bm{X}'$, which is reshaped to $\bm{I}'$. Note that dark borders indicate updated features.
  • Figure 3: Optimized DEGC module. LU refers to $\text{local\_update()}$ function, and the dark borders indicate updated patch features.
  • Figure 4: ClusterViG architecture comprised of a pre-processing Stem, a backbone of Grapher and FFN repeated $n_b$ times, and a post-processing Head.
  • Figure 5: Comparing the inference performance of models in three groups.