Table of Contents
Fetching ...

HGFormer: Topology-Aware Vision Transformer with HyperGraph Learning

Hao Wang, Shuo Zhang, Biao Leng

TL;DR

HGFormer introduces a topology-aware HyperGraph Transformer that integrates a Center Sampling K-Nearest Neighbors construction and topology-guided HyperGraph Attention to encode local groups and spatial topology within a vision transformer. By transforming image tokens into a hypergraph and performing node↔hyperedge↔node messaging, it achieves higher-order modeling while maintaining competitive complexity. Across ImageNet, COCO, ADE20K, pose estimation, and weakly supervised segmentation, HGFormer attains competitive or superior results compared to SoTA methods, with clear ablations validating the CS-KNN and HGA contributions. The approach demonstrates the practical value of perceptual organization in transformers, though it relies on carefully tuned hypergraph construction and hyperparameters for different tasks and resolutions.

Abstract

The computer vision community has witnessed an extensive exploration of vision transformers in the past two years. Drawing inspiration from traditional schemes, numerous works focus on introducing vision-specific inductive biases. However, the implicit modeling of permutation invariance and fully-connected interaction with individual tokens disrupts the regional context and spatial topology, further hindering higher-order modeling. This deviates from the principle of perceptual organization that emphasizes the local groups and overall topology of visual elements. Thus, we introduce the concept of hypergraph for perceptual exploration. Specifically, we propose a topology-aware vision transformer called HyperGraph Transformer (HGFormer). Firstly, we present a Center Sampling K-Nearest Neighbors (CS-KNN) algorithm for semantic guidance during hypergraph construction. Secondly, we present a topology-aware HyperGraph Attention (HGA) mechanism that integrates hypergraph topology as perceptual indications to guide the aggregation of global and unbiased information during hypergraph messaging. Using HGFormer as visual backbone, we develop an effective and unitive representation, achieving distinct and detailed scene depictions. Empirical experiments show that the proposed HGFormer achieves competitive performance compared to the recent SoTA counterparts on various visual benchmarks. Extensive ablation and visualization studies provide comprehensive explanations of our ideas and contributions.

HGFormer: Topology-Aware Vision Transformer with HyperGraph Learning

TL;DR

HGFormer introduces a topology-aware HyperGraph Transformer that integrates a Center Sampling K-Nearest Neighbors construction and topology-guided HyperGraph Attention to encode local groups and spatial topology within a vision transformer. By transforming image tokens into a hypergraph and performing node↔hyperedge↔node messaging, it achieves higher-order modeling while maintaining competitive complexity. Across ImageNet, COCO, ADE20K, pose estimation, and weakly supervised segmentation, HGFormer attains competitive or superior results compared to SoTA methods, with clear ablations validating the CS-KNN and HGA contributions. The approach demonstrates the practical value of perceptual organization in transformers, though it relies on carefully tuned hypergraph construction and hyperparameters for different tasks and resolutions.

Abstract

The computer vision community has witnessed an extensive exploration of vision transformers in the past two years. Drawing inspiration from traditional schemes, numerous works focus on introducing vision-specific inductive biases. However, the implicit modeling of permutation invariance and fully-connected interaction with individual tokens disrupts the regional context and spatial topology, further hindering higher-order modeling. This deviates from the principle of perceptual organization that emphasizes the local groups and overall topology of visual elements. Thus, we introduce the concept of hypergraph for perceptual exploration. Specifically, we propose a topology-aware vision transformer called HyperGraph Transformer (HGFormer). Firstly, we present a Center Sampling K-Nearest Neighbors (CS-KNN) algorithm for semantic guidance during hypergraph construction. Secondly, we present a topology-aware HyperGraph Attention (HGA) mechanism that integrates hypergraph topology as perceptual indications to guide the aggregation of global and unbiased information during hypergraph messaging. Using HGFormer as visual backbone, we develop an effective and unitive representation, achieving distinct and detailed scene depictions. Empirical experiments show that the proposed HGFormer achieves competitive performance compared to the recent SoTA counterparts on various visual benchmarks. Extensive ablation and visualization studies provide comprehensive explanations of our ideas and contributions.

Paper Structure

This paper contains 34 sections, 8 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Visualizations of feature maps in different methods on ImageNet. (a) Original Image. (b) ViT. (c) ViHGNN. (d) HGFormer(Ours). ViT blends the foreground with the background ambiguously. ViHGNN distinguishes between foreground and background, but the portrayals of objects is rough and unclear. HGFormer(Ours) significantly highlights the foreground and suppresses the background, achieving a detailed depiction of objects. Zoom in for better view.
  • Figure 2: Illustration of hypergraph concept. (a) Moving beyond grid or sequence, input feature map is transformed into hypergraph by the proposed CS-KNN algorithm, where the stars denote sampling centers and the nodes connected by them signify hyperedges with semantic dependencies. (b) Hyperedge tokens are generated by aggregating their relevant nodes within the hyperedge local topologies, during which higher-order semantics are explored and irrelevant noise is potentially eliminated. (c) Node tokens are updated by aggregating their relevant hyperedges within the node local topologies, during which message across all nodes is enhanced and propagated. As messaging functions, the proposed topology-aware HGA incorporates the topology perception of HGConv as perceptual indications and the global understanding of Transformer for contextual refinement.Zoom in for better view.
  • Figure 3: Framework of HGFormer block. (a) Based on the proposed CS-KNN algorithm, input images are transformed into hypergraph, where the stars denote sampling centers and the nodes connected by them signify hyperedges with semantic dependencies, visualized by the same color. (b) HGFormer block is built based on the node-hyperedge-node messaging mechanism, enabling representation learning through the high-order relational reasoning. (c) As messaging functions, the proposed topology-aware HGA incorporates the topology perception of HGConv as perceptual indications and the global understanding of Transformer for contextual refinement. $K$ indicates the number of nodes within hyperedge local topology, $K_e$ indicates the number of hyperedges within node local topology, $N$ indicates the number of all nodes, $N_e$ indicates the number of all hyperedges. Zoom in for better view.
  • Figure 4: Architecture of HGFormer network with four stages. Following he2016deepwang2021pyramid, HGFormer network is constructed as a 4-stage pyramid architecture. Each stage $i$ consists of an embedding module and $N_i$ HGFormer blocks. Within HGFormer block, the input token map is transformed into hypergraph, where nodes with dependencies are assigned into the same hyperedge, visualized on the upper part. HGFormer block is configured with node-hyperedge-node messaging mechanism for relational reasoning. Zoom in for better view.
  • Figure 5: Performance Comparisons between the proposed method and the recent SoTA models hassani2023neighborhoodliu2022convnetliu2021swin. Our work outperforms these Transformer and ConvNet counterparts with similar parameters and computation.
  • ...and 7 more figures