HC-GST: Heterophily-aware Distribution Consistency based Graph Self-training
Fali Wang, Tianxiang Zhao, Junjie Xu, Suhang Wang
TL;DR
The paper tackles training bias in graph self-training (GST) caused by shifts in homophily ratios on heterophilic graphs. It introduces HC-GST, a framework that estimates homophily via soft labels, selects pseudo-nodes to match the global homophily distribution, assigns labels with multi-hop neighbors for heterophilic nodes, and uses a dual-head GNN to leverage all high-quality pseudo-nodes without harming the main classifier. Empirical results across homophilic and heterophilic graphs show HC-GST reduces distribution shifts, lowers bias across homophily bins, and improves self-training performance, especially at low label rates. The approach demonstrates robust gains and provides a principled way to adapt GST to heterophily, with implications for more reliable semi-supervised learning on diverse graphs.
Abstract
Graph self-training (GST), which selects and assigns pseudo-labels to unlabeled nodes, is popular for tackling label sparsity in graphs. However, recent study on homophily graphs show that GST methods could introduce and amplify distribution shift between training and test nodes as they tend to assign pseudo-labels to nodes they are good at. As GNNs typically perform better on homophilic nodes, there could be potential shifts towards homophilic pseudo-nodes, which is underexplored. Our preliminary experiments on heterophilic graphs verify that these methods can cause shifts in homophily ratio distributions, leading to \textit{training bias} that improves performance on homophilic nodes while degrading it on heterophilic ones. Therefore, we study a novel problem of reducing homophily ratio distribution shifts during self-training on heterophilic graphs. A key challenge is the accurate calculation of homophily ratios and their distributions without extensive labeled data. To tackle them, we propose a novel Heterophily-aware Distribution Consistency-based Graph Self-Training (HC-GST) framework, which estimates homophily ratios using soft labels and optimizes a selection vector to align pseudo-nodes with the global homophily ratio distribution. Extensive experiments on both homophilic and heterophilic graphs show that HC-GST effectively reduces training bias and enhances self-training performance.
