Table of Contents
Fetching ...

DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition

Caoshuo Li, Tanzhe Li, Xiaobin Hu, Donghao Luo, Taisong Jin

TL;DR

DVHGNN introduces a multi-scale dilated vision hypergraph neural network to efficiently model high-order relations in images while reducing computation relative to prior graph-based backbones. By combining clustering-based hyperedges with dilated hypergraph construction and a two-stage dynamic hypergraph convolution, the approach captures both local and long-range dependencies through adaptive edge weights and cosine similarity. The method achieves state-of-the-art or competitive performance across ImageNet classification, COCO detection/segmentation, and ADE20K segmentation, notably surpassing ViG and ViHGNN baselines with lower FLOPs. This work demonstrates the value of learnable hypergraph structures for flexible, scalable vision backbones with strong representational power.

Abstract

Recently, Vision Graph Neural Network (ViG) has gained considerable attention in computer vision. Despite its groundbreaking innovation, Vision Graph Neural Network encounters key issues including the quadratic computational complexity caused by its K-Nearest Neighbor (KNN) graph construction and the limitation of pairwise relations of normal graphs. To address the aforementioned challenges, we propose a novel vision architecture, termed Dilated Vision HyperGraph Neural Network (DVHGNN), which is designed to leverage multi-scale hypergraph to efficiently capture high-order correlations among objects. Specifically, the proposed method tailors Clustering and Dilated HyperGraph Construction (DHGC) to adaptively capture multi-scale dependencies among the data samples. Furthermore, a dynamic hypergraph convolution mechanism is proposed to facilitate adaptive feature exchange and fusion at the hypergraph level. Extensive qualitative and quantitative evaluations of the benchmark image datasets demonstrate that the proposed DVHGNN significantly outperforms the state-of-the-art vision backbones. For instance, our DVHGNN-S achieves an impressive top-1 accuracy of 83.1% on ImageNet-1K, surpassing ViG-S by +1.0% and ViHGNN-S by +0.6%.

DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition

TL;DR

DVHGNN introduces a multi-scale dilated vision hypergraph neural network to efficiently model high-order relations in images while reducing computation relative to prior graph-based backbones. By combining clustering-based hyperedges with dilated hypergraph construction and a two-stage dynamic hypergraph convolution, the approach captures both local and long-range dependencies through adaptive edge weights and cosine similarity. The method achieves state-of-the-art or competitive performance across ImageNet classification, COCO detection/segmentation, and ADE20K segmentation, notably surpassing ViG and ViHGNN baselines with lower FLOPs. This work demonstrates the value of learnable hypergraph structures for flexible, scalable vision backbones with strong representational power.

Abstract

Recently, Vision Graph Neural Network (ViG) has gained considerable attention in computer vision. Despite its groundbreaking innovation, Vision Graph Neural Network encounters key issues including the quadratic computational complexity caused by its K-Nearest Neighbor (KNN) graph construction and the limitation of pairwise relations of normal graphs. To address the aforementioned challenges, we propose a novel vision architecture, termed Dilated Vision HyperGraph Neural Network (DVHGNN), which is designed to leverage multi-scale hypergraph to efficiently capture high-order correlations among objects. Specifically, the proposed method tailors Clustering and Dilated HyperGraph Construction (DHGC) to adaptively capture multi-scale dependencies among the data samples. Furthermore, a dynamic hypergraph convolution mechanism is proposed to facilitate adaptive feature exchange and fusion at the hypergraph level. Extensive qualitative and quantitative evaluations of the benchmark image datasets demonstrate that the proposed DVHGNN significantly outperforms the state-of-the-art vision backbones. For instance, our DVHGNN-S achieves an impressive top-1 accuracy of 83.1% on ImageNet-1K, surpassing ViG-S by +1.0% and ViHGNN-S by +0.6%.

Paper Structure

This paper contains 20 sections, 7 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Comparison of FLOPs and Top-1 accuracy on ImageNet-1K. The proposed DVHGNN achieves the best performance compared to other state-of-the-art models.
  • Figure 2: Architecture of the proposed DVHGNN. In each block, Multi-Scale(MS) DVHGNN block constructs multi-scale hyperedges, followed by message passing through vertex and hyperedge convolutions, and finalizes with ConvFFN to enhance feature transformation capacity and counteract over-smoothing.
  • Figure 3: Illustration of Multi-Scale Hypergraph Construction (without region partition). The final hyperedge set is composed of two types of hyperedges: a set of size $C$ obtained from cosine similarity clustering, and a set of size $R$ derived from DHGC. Each hyperedge corresponds to a hyperedge centroid, marked with a pentagon in the diagram. By default, $R$ = 3, with the distinct dilated hyperedges corresponding to a kernel size of 3 $\times$ 3 with dilation rates $r$ = 1, 2, and 3, respectively, resulting in receptive field sizes of 3 $\times$ 3 , 5 $\times$ 5 , and 7 $\times$ 7 .
  • Figure 4: Illustration of two-stage message passing of our Dynamic Hypergrpah Convolution (DHConv). $\textbf{h}_{c}$ is the feature of the hyperedge centroid, $\mathbf{S}_{ie}$ is the cosine similarity matrix between vertices and hyperedge centroids, and $\textbf{x}_{i}$ and $\textbf{x}'_{i}$ represent the vertex feature before and after DHConv. Note that how messages flow to vertex 2 is marked in red.
  • Figure 5: Visualization of the hypergraph structure of DVHGNN. The hypergraph structure is obtained by an overlay of hyperedges derived from the Clustering method and DHGC.
  • ...and 1 more figures