Table of Contents
Fetching ...

Cluster-wise Graph Transformer with Dual-granularity Kernelized Attention

Siyuan Huang, Yunchong Song, Jiayue Zhou, Zhouhan Lin

TL;DR

This work tackles the loss of node-level detail in cluster-based graph pooling by introducing Node-to-Cluster Attention (N2C-Attn), which fuses node- and cluster-level information via Multiple Kernel Learning. By avoiding full coarsening and employing kernelized softmax for linear-time complexity, Cluster-GT uses clusters as tokens and enables cluster-wise interaction through N2C-Attn, with two MKL variants: tensor-product and convex-sum kernels. The model combines a node-wise GNN, a simple Metis partitioner, and the N2C-Attn module, achieving strong performance across eight graph-level datasets and revealing domain-dependent shifts in kernel emphasis. These results demonstrate a scalable, rich representation for hierarchical graphs and point to new directions in interaction strategies between clusters and nodes. Overall, N2C-Attn provides a principled, efficient bridge between cluster- and node-level representations for graph learning.

Abstract

In the realm of graph learning, there is a category of methods that conceptualize graphs as hierarchical structures, utilizing node clustering to capture broader structural information. While generally effective, these methods often rely on a fixed graph coarsening routine, leading to overly homogeneous cluster representations and loss of node-level information. In this paper, we envision the graph as a network of interconnected node sets without compressing each cluster into a single embedding. To enable effective information transfer among these node sets, we propose the Node-to-Cluster Attention (N2C-Attn) mechanism. N2C-Attn incorporates techniques from Multiple Kernel Learning into the kernelized attention framework, effectively capturing information at both node and cluster levels. We then devise an efficient form for N2C-Attn using the cluster-wise message-passing framework, achieving linear time complexity. We further analyze how N2C-Attn combines bi-level feature maps of queries and keys, demonstrating its capability to merge dual-granularity information. The resulting architecture, Cluster-wise Graph Transformer (Cluster-GT), which uses node clusters as tokens and employs our proposed N2C-Attn module, shows superior performance on various graph-level tasks. Code is available at https://github.com/LUMIA-Group/Cluster-wise-Graph-Transformer.

Cluster-wise Graph Transformer with Dual-granularity Kernelized Attention

TL;DR

This work tackles the loss of node-level detail in cluster-based graph pooling by introducing Node-to-Cluster Attention (N2C-Attn), which fuses node- and cluster-level information via Multiple Kernel Learning. By avoiding full coarsening and employing kernelized softmax for linear-time complexity, Cluster-GT uses clusters as tokens and enables cluster-wise interaction through N2C-Attn, with two MKL variants: tensor-product and convex-sum kernels. The model combines a node-wise GNN, a simple Metis partitioner, and the N2C-Attn module, achieving strong performance across eight graph-level datasets and revealing domain-dependent shifts in kernel emphasis. These results demonstrate a scalable, rich representation for hierarchical graphs and point to new directions in interaction strategies between clusters and nodes. Overall, N2C-Attn provides a principled, efficient bridge between cluster- and node-level representations for graph learning.

Abstract

In the realm of graph learning, there is a category of methods that conceptualize graphs as hierarchical structures, utilizing node clustering to capture broader structural information. While generally effective, these methods often rely on a fixed graph coarsening routine, leading to overly homogeneous cluster representations and loss of node-level information. In this paper, we envision the graph as a network of interconnected node sets without compressing each cluster into a single embedding. To enable effective information transfer among these node sets, we propose the Node-to-Cluster Attention (N2C-Attn) mechanism. N2C-Attn incorporates techniques from Multiple Kernel Learning into the kernelized attention framework, effectively capturing information at both node and cluster levels. We then devise an efficient form for N2C-Attn using the cluster-wise message-passing framework, achieving linear time complexity. We further analyze how N2C-Attn combines bi-level feature maps of queries and keys, demonstrating its capability to merge dual-granularity information. The resulting architecture, Cluster-wise Graph Transformer (Cluster-GT), which uses node clusters as tokens and employs our proposed N2C-Attn module, shows superior performance on various graph-level tasks. Code is available at https://github.com/LUMIA-Group/Cluster-wise-Graph-Transformer.

Paper Structure

This paper contains 45 sections, 1 theorem, 32 equations, 6 figures, 3 tables.

Key Result

Proposition 1

If $\kappa_{C}(Q_i, K_j) = \langle \phi(Q_i), \phi(K_j) \rangle$ and $\kappa_{N}(q_i, k_t) = \langle \psi(q_i), \psi(k_t) \rangle$, where $\phi$ and $\psi$ are feature maps for the respective kernels, then the Node-to-Cluster Attention with the tensor product kernel implies the following equivalent where $\otimes$ represents the outer product of the node-level and cluster-level feature maps. Conv

Figures (6)

  • Figure 1: Definition of Node-to-Cluster Attention (N2C-Attn). N2C-Attn perceives the graph as interconnected node sets instead of coarsening each cluster into a single node. It integrates multiple kernel learning methods into the kernelized attention framework to facilitate message propagation among node clusters, simultaneously capturing both the node-level and cluster-level information.
  • Figure 2: An efficient implementation of N2C-Attn-T with the message-passing framework. $|\mathcal{N^P}|$ denotes the number of clusters and $|\mathcal{E^P}|$ denotes the number of edges between clusters. The computation can be decomposed into 4 steps: 1) aggregation of node-level keys and values within each cluster, 2) computation of gate on each edge with the cluster-level kernel, 3) message propagation among clusters, 4) dot product of aggregated value with the node-level query of each cluster.
  • Figure 2: Comparison with Graph Transformers on ZINC and MolHIV over 4 different runs of 4 different seeds. We highlight the best results. Missing values from literature are indicated as ’-’.
  • Figure 3: Architecture of Cluster-wise Graph Transformer (Cluster-GT), which can be decomposed into three main modules: 1) a node-wise convolution module with GNN, 2) a graph partition module with Metis, and 3) a cluster-wise interaction module with N2C-Attn.
  • Figure 4: Visualization of $\alpha$ (weight of the cluster-level kernel) during the training process. N2C-Attn learns to pay more attention to cluster-level information in social networks than in bioinformatics.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Proposition 1