Table of Contents
Fetching ...

Enhancing Graph Transformers with Hierarchical Distance Structural Encoding

Yuankai Luo, Hongkang Li, Lei Shi, Xiao-Ming Wu

TL;DR

The paper introduces Hierarchical Distance Structural Encoding (HDSE) to inject multi-level graph hierarchy information into graph transformer attention, addressing the lack of hierarchical bias in existing methods. HDSE defines Graph Hierarchy Distance (GHD) and encodes node-pair distances across multiple coarsening levels, enabling a learned bias that improves expressivity over shortest-path only encodings. The approach integrates HDSE into standard graph transformers and scales to large graphs via high-level HDSE that couples with linear attention, yielding strong performance gains on graph-level and billion-node-scale node classification tasks while maintaining efficiency. Theoretical results show HDSE is strictly more expressive than SPD within the GD-WL framework, and empirical results across 18 benchmarks demonstrate robust improvements with different coarsening strategies. This work has practical impact for molecules, social networks, and large-scale graphs where hierarchical structure is pivotal for accurate reasoning and scalable inference.

Abstract

Graph transformers need strong inductive biases to derive meaningful attention scores. Yet, current methods often fall short in capturing longer ranges, hierarchical structures, or community structures, which are common in various graphs such as molecules, social networks, and citation networks. This paper presents a Hierarchical Distance Structural Encoding (HDSE) method to model node distances in a graph, focusing on its multi-level, hierarchical nature. We introduce a novel framework to seamlessly integrate HDSE into the attention mechanism of existing graph transformers, allowing for simultaneous application with other positional encodings. To apply graph transformers with HDSE to large-scale graphs, we further propose a high-level HDSE that effectively biases the linear transformers towards graph hierarchies. We theoretically prove the superiority of HDSE over shortest path distances in terms of expressivity and generalization. Empirically, we demonstrate that graph transformers with HDSE excel in graph classification, regression on 7 graph-level datasets, and node classification on 11 large-scale graphs, including those with up to a billion nodes.

Enhancing Graph Transformers with Hierarchical Distance Structural Encoding

TL;DR

The paper introduces Hierarchical Distance Structural Encoding (HDSE) to inject multi-level graph hierarchy information into graph transformer attention, addressing the lack of hierarchical bias in existing methods. HDSE defines Graph Hierarchy Distance (GHD) and encodes node-pair distances across multiple coarsening levels, enabling a learned bias that improves expressivity over shortest-path only encodings. The approach integrates HDSE into standard graph transformers and scales to large graphs via high-level HDSE that couples with linear attention, yielding strong performance gains on graph-level and billion-node-scale node classification tasks while maintaining efficiency. Theoretical results show HDSE is strictly more expressive than SPD within the GD-WL framework, and empirical results across 18 benchmarks demonstrate robust improvements with different coarsening strategies. This work has practical impact for molecules, social networks, and large-scale graphs where hierarchical structure is pivotal for accurate reasoning and scalable inference.

Abstract

Graph transformers need strong inductive biases to derive meaningful attention scores. Yet, current methods often fall short in capturing longer ranges, hierarchical structures, or community structures, which are common in various graphs such as molecules, social networks, and citation networks. This paper presents a Hierarchical Distance Structural Encoding (HDSE) method to model node distances in a graph, focusing on its multi-level, hierarchical nature. We introduce a novel framework to seamlessly integrate HDSE into the attention mechanism of existing graph transformers, allowing for simultaneous application with other positional encodings. To apply graph transformers with HDSE to large-scale graphs, we further propose a high-level HDSE that effectively biases the linear transformers towards graph hierarchies. We theoretically prove the superiority of HDSE over shortest path distances in terms of expressivity and generalization. Empirically, we demonstrate that graph transformers with HDSE excel in graph classification, regression on 7 graph-level datasets, and node classification on 11 large-scale graphs, including those with up to a billion nodes.
Paper Structure (30 sections, 8 theorems, 16 equations, 4 figures, 20 tables)

This paper contains 30 sections, 8 theorems, 16 equations, 4 figures, 20 tables.

Key Result

Proposition 1

GD-WL with HDSE $(\mathrm{D}_{i,j})$ is strictly more expressive than GD-WL with the shortest path distance $\mathrm{SPD}(i, j)$.

Figures (4)

  • Figure 1: Overview of our proposed hierarchical distance structural encoding (HDSE) and its integration with graph transformers. HDSE uses the graph hierarchy distance (GHD, refer to Definition \ref{['def1']}) that can capture interpretable patterns in graph-structured data by using diverse graph coarsening algorithms. Darker colors indicate longer distances.
  • Figure 2: Examples of graph coarsening results and hierarchy distances. Left: HDSE can capture chemical motifs such as CF3 and aromatic rings on molecule graphs. Right: HDSE can distinguish the Dodecahedron and Desargues graphs. The Dodecahedral graph has 1-level hierarchy distances of length 2 (indicated by the dark color), while the Desargues graph doesn’t. In contrast, the GD-WL test with SPD cannot distinguish these graphs zhang2023rethinking.
  • Figure 3: GD-WL with HDSE can distinguish Dodecahedron and Desargues graphs, but GD-WL with SPD cannot.
  • Figure 4: Visualization of attention weights for the transformer attention and HDSE attention. The left side illustrates the graph coarsening result. The center column displays the attention weights of a sample node learned by the classic GT dwivedi2020generalization, while the right column showcases the attention weights learned by the HDSE attention.

Theorems & Definitions (13)

  • Definition 1: Graph Hierarchy Distance
  • Proposition 1: Expressiveness of HDSE
  • Proposition 2
  • Corollary 1: Expressiveness of Graph Transformers with HDSE
  • Proposition 3: Generalization of Graph Transformers with HDSE
  • Proposition 4
  • proof
  • Proposition 5
  • proof
  • Proposition 6
  • ...and 3 more