Table of Contents
Fetching ...

PVG: Progressive Vision Graph for Vision Recognition

Jiafu Wu, Jian Li, Jiangning Zhang, Boshen Zhang, Mingmin Chi, Yabiao Wang, Chengjie Wang

TL;DR

The paper addresses the challenge of capturing irregular visual objects with graph-based backbones, where existing Vision GNNs suffer from inaccurate neighbor selection and over-smoothing in deep layers. It proposes Progressive Vision Graph (PVG), integrating three components: PSGC to encode second-order similarity without extra cost, MaxE for efficient neighbor aggregation, and GraphLU to maintain feature diversity in deep networks. Together, these yield state-of-the-art performance on ImageNet-1K and COCO with reduced parameters compared to prior graph-based backbones. The work demonstrates PVG's effectiveness as a general vision backbone and highlights its potential to broaden the role of GNNs in computer vision for more flexible, non-Euclidean feature representations.

Abstract

Convolution-based and Transformer-based vision backbone networks process images into the grid or sequence structures, respectively, which are inflexible for capturing irregular objects. Though Vision GNN (ViG) adopts graph-level features for complex images, it has some issues, such as inaccurate neighbor node selection, expensive node information aggregation calculation, and over-smoothing in the deep layers. To address the above problems, we propose a Progressive Vision Graph (PVG) architecture for vision recognition task. Compared with previous works, PVG contains three main components: 1) Progressively Separated Graph Construction (PSGC) to introduce second-order similarity by gradually increasing the channel of the global graph branch and decreasing the channel of local branch as the layer deepens; 2) Neighbor nodes information aggregation and update module by using Max pooling and mathematical Expectation (MaxE) to aggregate rich neighbor information; 3) Graph error Linear Unit (GraphLU) to enhance low-value information in a relaxed form to reduce the compression of image detail information for alleviating the over-smoothing. Extensive experiments on mainstream benchmarks demonstrate the superiority of PVG over state-of-the-art methods, e.g., our PVG-S obtains 83.0% Top-1 accuracy on ImageNet-1K that surpasses GNN-based ViG-S by +0.9 with the parameters reduced by 18.5%, while the largest PVG-B obtains 84.2% that has +0.5 improvement than ViG-B. Furthermore, our PVG-S obtains +1.3 box AP and +0.4 mask AP gains than ViG-S on COCO dataset.

PVG: Progressive Vision Graph for Vision Recognition

TL;DR

The paper addresses the challenge of capturing irregular visual objects with graph-based backbones, where existing Vision GNNs suffer from inaccurate neighbor selection and over-smoothing in deep layers. It proposes Progressive Vision Graph (PVG), integrating three components: PSGC to encode second-order similarity without extra cost, MaxE for efficient neighbor aggregation, and GraphLU to maintain feature diversity in deep networks. Together, these yield state-of-the-art performance on ImageNet-1K and COCO with reduced parameters compared to prior graph-based backbones. The work demonstrates PVG's effectiveness as a general vision backbone and highlights its potential to broaden the role of GNNs in computer vision for more flexible, non-Euclidean feature representations.

Abstract

Convolution-based and Transformer-based vision backbone networks process images into the grid or sequence structures, respectively, which are inflexible for capturing irregular objects. Though Vision GNN (ViG) adopts graph-level features for complex images, it has some issues, such as inaccurate neighbor node selection, expensive node information aggregation calculation, and over-smoothing in the deep layers. To address the above problems, we propose a Progressive Vision Graph (PVG) architecture for vision recognition task. Compared with previous works, PVG contains three main components: 1) Progressively Separated Graph Construction (PSGC) to introduce second-order similarity by gradually increasing the channel of the global graph branch and decreasing the channel of local branch as the layer deepens; 2) Neighbor nodes information aggregation and update module by using Max pooling and mathematical Expectation (MaxE) to aggregate rich neighbor information; 3) Graph error Linear Unit (GraphLU) to enhance low-value information in a relaxed form to reduce the compression of image detail information for alleviating the over-smoothing. Extensive experiments on mainstream benchmarks demonstrate the superiority of PVG over state-of-the-art methods, e.g., our PVG-S obtains 83.0% Top-1 accuracy on ImageNet-1K that surpasses GNN-based ViG-S by +0.9 with the parameters reduced by 18.5%, while the largest PVG-B obtains 84.2% that has +0.5 improvement than ViG-B. Furthermore, our PVG-S obtains +1.3 box AP and +0.4 mask AP gains than ViG-S on COCO dataset.
Paper Structure (17 sections, 2 theorems, 27 equations, 8 figures, 4 tables)

This paper contains 17 sections, 2 theorems, 27 equations, 8 figures, 4 tables.

Key Result

theorem 1

First-order Similarity. The first-order similarity between nodes "i" and "j" can be expressed as follows: where l represents a distance metric such as Euclidean distance, cosine distance, or dot product distance.

Figures (8)

  • Figure 1: The comparison of Accuracy vs. Parameters and FLOPs between PVG and ViG. Our PVG achieves the best balance between accuracy and computation on the Imagenet.
  • Figure 2: Our PVG architecture is designed in a cascaded four-stage manner, with each stage adopting the Progressively Separated Graph Construction (PSGC) to introduce second-order similarity, which transfers channels from local to global graphs between adjacent blocks. After graph construction, PVG uses our proposed MaxE in each block for information aggregation and update. Additionally, PVG utilizes a concise activation function GraphLU to enhance detail for alleviating the over-smoothing problem.
  • Figure 3: Illustration of First-order similarity and Second-order similarity.
  • Figure 4: Illustration of neighbor sampling in MaxE.
  • Figure 5: Illustration of GraphLU ($\varepsilon = 1$) for alleviating the over-smoothing problem of graph network.
  • ...and 3 more figures

Theorems & Definitions (2)

  • theorem 1
  • theorem 2