Table of Contents
Fetching ...

Scaling Graph Convolutions for Mobile Vision

William Avery, Mustafa Munir, Radu Marculescu

TL;DR

The paper introduces Mobile Graph Convolution (MGC) to overcome scaling limitations in mobile vision graphs, addressing the inefficiency of prior SVGA-based approaches. By enforcing fixed, sparse connections and incorporating conditional positional encodings, MGC enables higher-resolution graph operations with minimal latency, enabling the MobileViGv2 architecture to rival state-of-the-art CNN-ViT mobile models. Across ImageNet-1K, MS COCO, and ADE20K, MobileViGv2 achieves superior or competitive accuracy with favorable latency on mobile hardware, and ablations confirm the advantages of sparsity and CPE over dense graph configurations. This work demonstrates that CNN-GNN hybrids can effectively compete with, and even outperform, traditional mobile architectures in both classification and downstream vision tasks, with practical implications for on-device AI.

Abstract

To compete with existing mobile architectures, MobileViG introduces Sparse Vision Graph Attention (SVGA), a fast token-mixing operator based on the principles of GNNs. However, MobileViG scales poorly with model size, falling at most 1% behind models with similar latency. This paper introduces Mobile Graph Convolution (MGC), a new vision graph neural network (ViG) module that solves this scaling problem. Our proposed mobile vision architecture, MobileViGv2, uses MGC to demonstrate the effectiveness of our approach. MGC improves on SVGA by increasing graph sparsity and introducing conditional positional encodings to the graph operation. Our smallest model, MobileViGv2-Ti, achieves a 77.7% top-1 accuracy on ImageNet-1K, 2% higher than MobileViG-Ti, with 0.9 ms inference latency on the iPhone 13 Mini NPU. Our largest model, MobileViGv2-B, achieves an 83.4% top-1 accuracy, 0.8% higher than MobileViG-B, with 2.7 ms inference latency. Besides image classification, we show that MobileViGv2 generalizes well to other tasks. For object detection and instance segmentation on MS COCO 2017, MobileViGv2-M outperforms MobileViG-M by 1.2 $AP^{box}$ and 0.7 $AP^{mask}$, and MobileViGv2-B outperforms MobileViG-B by 1.0 $AP^{box}$ and 0.7 $AP^{mask}$. For semantic segmentation on ADE20K, MobileViGv2-M achieves 42.9% $mIoU$ and MobileViGv2-B achieves 44.3% $mIoU$. Our code can be found at \url{https://github.com/SLDGroup/MobileViGv2}.

Scaling Graph Convolutions for Mobile Vision

TL;DR

The paper introduces Mobile Graph Convolution (MGC) to overcome scaling limitations in mobile vision graphs, addressing the inefficiency of prior SVGA-based approaches. By enforcing fixed, sparse connections and incorporating conditional positional encodings, MGC enables higher-resolution graph operations with minimal latency, enabling the MobileViGv2 architecture to rival state-of-the-art CNN-ViT mobile models. Across ImageNet-1K, MS COCO, and ADE20K, MobileViGv2 achieves superior or competitive accuracy with favorable latency on mobile hardware, and ablations confirm the advantages of sparsity and CPE over dense graph configurations. This work demonstrates that CNN-GNN hybrids can effectively compete with, and even outperform, traditional mobile architectures in both classification and downstream vision tasks, with practical implications for on-device AI.

Abstract

To compete with existing mobile architectures, MobileViG introduces Sparse Vision Graph Attention (SVGA), a fast token-mixing operator based on the principles of GNNs. However, MobileViG scales poorly with model size, falling at most 1% behind models with similar latency. This paper introduces Mobile Graph Convolution (MGC), a new vision graph neural network (ViG) module that solves this scaling problem. Our proposed mobile vision architecture, MobileViGv2, uses MGC to demonstrate the effectiveness of our approach. MGC improves on SVGA by increasing graph sparsity and introducing conditional positional encodings to the graph operation. Our smallest model, MobileViGv2-Ti, achieves a 77.7% top-1 accuracy on ImageNet-1K, 2% higher than MobileViG-Ti, with 0.9 ms inference latency on the iPhone 13 Mini NPU. Our largest model, MobileViGv2-B, achieves an 83.4% top-1 accuracy, 0.8% higher than MobileViG-B, with 2.7 ms inference latency. Besides image classification, we show that MobileViGv2 generalizes well to other tasks. For object detection and instance segmentation on MS COCO 2017, MobileViGv2-M outperforms MobileViG-M by 1.2 and 0.7 , and MobileViGv2-B outperforms MobileViG-B by 1.0 and 0.7 . For semantic segmentation on ADE20K, MobileViGv2-M achieves 42.9% and MobileViGv2-B achieves 44.3% . Our code can be found at \url{https://github.com/SLDGroup/MobileViGv2}.
Paper Structure (12 sections, 4 equations, 3 figures, 3 tables)

This paper contains 12 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Latency versus top-1 % accuracy on ImageNet-1K of MobileViG MobileViG and MobileViGv2. From this graph, we can see that MobileViGv2 improves on MobileViG, shifting the accuracy-latency curve up for similar points of inference latency.
  • Figure 2: The new MobileViGv2 architecture. The full architecture is shown on the left. The stem is composed of two stride two convolutions that downsample the input image by 4$\times$. Each downsampling block contains a single stride two convolution to downsample the input by 2$\times$. (a) An inverted residual block using GELU activation. For Stage 1, only inverted residuals are used. The number of inverted residuals in this stage is controlled by $N_{1}$. (b) For stages 2-4, a combination of inverted residuals and MGCs are used. Each stage has $N_{i}$ inverted residuals followed by $M_{i}$ MGCs, where $i$ is the stage number. The CPE block is a conditional positional encoding CPE implemented with a 7$\times$7 depthwise convolution. The MRConv block contains graph construction and the max-relative message passing step. (c) Computing max-relative features using graph construction as outlined in MGC. Given an input image, this module computes the max-relative score against a fixed set of shifted inputs: shifting right, left, up, and down by $k$. The outputs of this stage are the max-relative scores, which are concatenated to the input and passed through a 1$\times$1 convolution to complete message passing.
  • Figure 3: Mobile Graph Convolution (MGC) (left) versus Sparse Vision Graph Attention (SVGA) (right). Each grid is broken up such that the effective receptive field is equal to that of MobileViGv2 at Stage 4. (left) The connections made for the green token using MGC ($L=2$) are shown in blue. (right) The connections made for the green token using SVGA ($K=2$), as used in MobileViG, are shown in blue. The image above was obtained from the ImageNet-1K imagenet1k dataset and has been modified for this paper.