HC-GST: Heterophily-aware Distribution Consistency based Graph Self-training

Fali Wang; Tianxiang Zhao; Junjie Xu; Suhang Wang

HC-GST: Heterophily-aware Distribution Consistency based Graph Self-training

Fali Wang, Tianxiang Zhao, Junjie Xu, Suhang Wang

TL;DR

The paper tackles training bias in graph self-training (GST) caused by shifts in homophily ratios on heterophilic graphs. It introduces HC-GST, a framework that estimates homophily via soft labels, selects pseudo-nodes to match the global homophily distribution, assigns labels with multi-hop neighbors for heterophilic nodes, and uses a dual-head GNN to leverage all high-quality pseudo-nodes without harming the main classifier. Empirical results across homophilic and heterophilic graphs show HC-GST reduces distribution shifts, lowers bias across homophily bins, and improves self-training performance, especially at low label rates. The approach demonstrates robust gains and provides a principled way to adapt GST to heterophily, with implications for more reliable semi-supervised learning on diverse graphs.

Abstract

Graph self-training (GST), which selects and assigns pseudo-labels to unlabeled nodes, is popular for tackling label sparsity in graphs. However, recent study on homophily graphs show that GST methods could introduce and amplify distribution shift between training and test nodes as they tend to assign pseudo-labels to nodes they are good at. As GNNs typically perform better on homophilic nodes, there could be potential shifts towards homophilic pseudo-nodes, which is underexplored. Our preliminary experiments on heterophilic graphs verify that these methods can cause shifts in homophily ratio distributions, leading to \textit{training bias} that improves performance on homophilic nodes while degrading it on heterophilic ones. Therefore, we study a novel problem of reducing homophily ratio distribution shifts during self-training on heterophilic graphs. A key challenge is the accurate calculation of homophily ratios and their distributions without extensive labeled data. To tackle them, we propose a novel Heterophily-aware Distribution Consistency-based Graph Self-Training (HC-GST) framework, which estimates homophily ratios using soft labels and optimizes a selection vector to align pseudo-nodes with the global homophily ratio distribution. Extensive experiments on both homophilic and heterophilic graphs show that HC-GST effectively reduces training bias and enhances self-training performance.

HC-GST: Heterophily-aware Distribution Consistency based Graph Self-training

TL;DR

Abstract

Paper Structure (26 sections, 15 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 15 equations, 5 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Preliminary
Notations and GNNs
Pseudo-labeling and Distribution Shift
Training Bias of GST in Heterophilic Graphs
Problem Definition
Proposed Method
Heterophily-aware Distribution Consistent Pseudo-node Selection
Homophily Ratios and Distributions Estimation
Target Homophily Ratio Distribution.
Pseudo-node Selection in Heterophilic Graphs.
Pseudo-label Assignment with Multi-hop Neighbors
Dual-Head GNN for Fully Utilizing Pseudo-nodes
Workflow of HC-GST Framework
...and 11 more sections

Figures (5)

Figure 1: Preliminary experiments of BMGCN he2022block on the Chameleon Graph Rozemberczki2021. (a) Performance across crafted training sets biased towards homophily, aligned with the true graph, and biased towards heterophily. (b) Variation in mean homophily ratio across self-training stages for the BMGCN backbone, ST, and DCGST. (c) Training biases in traditional graph self-training (ST) shown through accuracies across different homophily bins. (d) Training biases in DCGST wang2024DCGST.
Figure 2: The heterophily-aware distribution consistency-based graph self-training framework. Red arrows indicate the loop.
Figure 3: Sensitivity analysis of hyper-parameters w.r.t. $\lambda_{\text{S}}$, $\lambda_{\text{D}}$, and $\delta_{h}$ on Chameleon with $1\%$ label rate.
Figure 4: Left: mean homophily ratio of pseudo-nodes during self-training stages on Squirrel. Right: KL divergence between local and global homophily ratio distributions.
Figure 5: Training bias across various homophily bins on Chameleon graph with $1\%$ label rate.

Theorems & Definitions (3)

definition 1: Node Homophily Ratio $h(v_i)$ pei2020geom
definition 2: Graph Homophily Ratio $h(\mathcal{G})$ pei2020geom
definition 3: Distribution Shift on Heterophily

HC-GST: Heterophily-aware Distribution Consistency based Graph Self-training

TL;DR

Abstract

HC-GST: Heterophily-aware Distribution Consistency based Graph Self-training

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (3)