FedGT: Federated Node Classification with Scalable Graph Transformer

Zaixi Zhang; Qingyong Hu; Yang Yu; Weibo Gao; Qi Liu

FedGT: Federated Node Classification with Scalable Graph Transformer

Zaixi Zhang, Qingyong Hu, Yang Yu, Weibo Gao, Qi Liu

TL;DR

The paper tackles node classification in subgraph federated learning under privacy constraints, where cross-subgraph links are missing and data distributions across clients are heterogeneous. It introduces FedGT, a scalable Graph Transformer that uses a hybrid local-global attention scheme with $n_s$ sampled neighbors and $n_g$ curated global nodes, achieving a linear-time complexity of $O(n(n_g+n_s))$ per forward pass. Global nodes are updated online via clustering, and client similarity for personalization is computed with optimal transport alignment of these nodes, enabling weighted, per-client aggregation; local differential privacy is applied to protect shared information. Theoretical analysis provides a bound on the approximation error of global attention, and extensive experiments on six datasets under two subgraph settings show that FedGT achieves state-of-the-art performance while effectively handling missing links and data heterogeneity, demonstrating practical impact for privacy-preserving, scalable graph learning in distributed environments.

Abstract

Graphs are widely used to model relational data. As graphs are getting larger and larger in real-world scenarios, there is a trend to store and compute subgraphs in multiple local systems. For example, recently proposed \emph{subgraph federated learning} methods train Graph Neural Networks (GNNs) distributively on local subgraphs and aggregate GNN parameters with a central server. However, existing methods have the following limitations: (1) The links between local subgraphs are missing in subgraph federated learning. This could severely damage the performance of GNNs that follow message-passing paradigms to update node/edge features. (2) Most existing methods overlook the subgraph heterogeneity issue, brought by subgraphs being from different parts of the whole graph. To address the aforementioned challenges, we propose a scalable \textbf{Fed}erated \textbf{G}raph \textbf{T}ransformer (\textbf{FedGT}) in the paper. Firstly, we design a hybrid attention scheme to reduce the complexity of the Graph Transformer to linear while ensuring a global receptive field with theoretical bounds. Specifically, each node attends to the sampled local neighbors and a set of curated global nodes to learn both local and global information and be robust to missing links. The global nodes are dynamically updated during training with an online clustering algorithm to capture the data distribution of the corresponding local subgraph. Secondly, FedGT computes clients' similarity based on the aligned global nodes with optimal transport. The similarity is then used to perform weighted averaging for personalized aggregation, which well addresses the data heterogeneity problem. Moreover, local differential privacy is applied to further protect the privacy of clients. Finally, extensive experimental results on 6 datasets and 2 subgraph settings demonstrate the superiority of FedGT.

FedGT: Federated Node Classification with Scalable Graph Transformer

TL;DR

sampled neighbors and

curated global nodes, achieving a linear-time complexity of

per forward pass. Global nodes are updated online via clustering, and client similarity for personalization is computed with optimal transport alignment of these nodes, enabling weighted, per-client aggregation; local differential privacy is applied to protect shared information. Theoretical analysis provides a bound on the approximation error of global attention, and extensive experiments on six datasets under two subgraph settings show that FedGT achieves state-of-the-art performance while effectively handling missing links and data heterogeneity, demonstrating practical impact for privacy-preserving, scalable graph learning in distributed environments.

Abstract

Paper Structure (38 sections, 2 theorems, 17 equations, 11 figures, 12 tables, 3 algorithms)

This paper contains 38 sections, 2 theorems, 17 equations, 11 figures, 12 tables, 3 algorithms.

Introduction
Related Work
Federated Learning and Federated Graph Learning
Graph Neural Network and Graph Transformer
Preliminaries
Problem Definition
Transformer Architecture
FedGT
Scalable Graph Transformer
Personalized Aggregation
Local Differential Privacy
Theoretical Analysis of Global Attention
Experiments
Experimental Settings
Experimental Results.
...and 23 more sections

Key Result

Theorem 1

Suppose the attention score function ($\mathcal{A}(\cdot)$) is Lipschitz continuous with constant $\mathcal{C}$. Let $\mu \in \mathbb{R}^{n_g \times d}$ denote the representations of global nodes. $P \in \mathbb{R}^{n_i \times n_g}$ is the assignment matrix to recover $\mathbf{H}$ i.e., $\mathbf{H} where $\sigma \triangleq \|\mathbf{H}- P\mu\|_F/ \|\mathbf{H}\|_F$ is the approximation error rate

Figures (11)

Figure 1: The framework of FedGT. We use a case with three clients for illustration and omit the model details of Client 2 for simplicity. The node colors indicate the node labels.
Figure 2: Similarity Heatmaps in the overlapping setting on Cora. (a) measures the cosine similarity of label distributions. (b) and (c) shows the normalized similarity in FedGT and FedGT w/o optimal transport (OT); (d) shows the normalized cosine similarity of local model updates at round 30.
Figure 3: Hyperparameter analysis in the non-overlapping setting (10 clients) on Cora. (a), (b), (c), and (d) show the influence of the number of layers $L$, global nodes $n_g$, sampled nodes $n_s$, and scaling hyperparameter $\tau$. (e) and (f) explore the influence of $\delta$ and $\lambda$ in LDP. We apply LDP only to the uploaded global nodes (e) or both the global nodes and local model updates (f).
Figure 4: (a) An illustration of subgraph federated Learning. There are three clients (subgraphs) and the color of each node indicates its label. Two main challenges are missing links and data heterogeneity. (b) The rooted tree on the local subgraph is biased due to the missing links, and the GNN is prone to make a false prediction based on the local subgraph.
Figure 5: The average testing accuracy in the non-overlapping setting with 10 clients over 100 rounds.
...and 6 more figures

Theorems & Definitions (5)

Definition 1
Theorem 1
proof
Theorem 1
proof

FedGT: Federated Node Classification with Scalable Graph Transformer

TL;DR

Abstract

FedGT: Federated Node Classification with Scalable Graph Transformer

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (5)