PVG: Progressive Vision Graph for Vision Recognition
Jiafu Wu, Jian Li, Jiangning Zhang, Boshen Zhang, Mingmin Chi, Yabiao Wang, Chengjie Wang
TL;DR
The paper addresses the challenge of capturing irregular visual objects with graph-based backbones, where existing Vision GNNs suffer from inaccurate neighbor selection and over-smoothing in deep layers. It proposes Progressive Vision Graph (PVG), integrating three components: PSGC to encode second-order similarity without extra cost, MaxE for efficient neighbor aggregation, and GraphLU to maintain feature diversity in deep networks. Together, these yield state-of-the-art performance on ImageNet-1K and COCO with reduced parameters compared to prior graph-based backbones. The work demonstrates PVG's effectiveness as a general vision backbone and highlights its potential to broaden the role of GNNs in computer vision for more flexible, non-Euclidean feature representations.
Abstract
Convolution-based and Transformer-based vision backbone networks process images into the grid or sequence structures, respectively, which are inflexible for capturing irregular objects. Though Vision GNN (ViG) adopts graph-level features for complex images, it has some issues, such as inaccurate neighbor node selection, expensive node information aggregation calculation, and over-smoothing in the deep layers. To address the above problems, we propose a Progressive Vision Graph (PVG) architecture for vision recognition task. Compared with previous works, PVG contains three main components: 1) Progressively Separated Graph Construction (PSGC) to introduce second-order similarity by gradually increasing the channel of the global graph branch and decreasing the channel of local branch as the layer deepens; 2) Neighbor nodes information aggregation and update module by using Max pooling and mathematical Expectation (MaxE) to aggregate rich neighbor information; 3) Graph error Linear Unit (GraphLU) to enhance low-value information in a relaxed form to reduce the compression of image detail information for alleviating the over-smoothing. Extensive experiments on mainstream benchmarks demonstrate the superiority of PVG over state-of-the-art methods, e.g., our PVG-S obtains 83.0% Top-1 accuracy on ImageNet-1K that surpasses GNN-based ViG-S by +0.9 with the parameters reduced by 18.5%, while the largest PVG-B obtains 84.2% that has +0.5 improvement than ViG-B. Furthermore, our PVG-S obtains +1.3 box AP and +0.4 mask AP gains than ViG-S on COCO dataset.
