Mitigating Label Noise on Graph via Topological Sample Selection
Yuhao Wu, Jiangchao Yao, Xiaobo Xia, Jun Yu, Ruxin Wang, Bo Han, Tongliang Liu
TL;DR
The paper addresses the challenge of label noise in graph-structured data by introducing Topological Sample Selection (TSS), a graph curriculum learning framework guided by Class-conditional Betweenness Centrality (CBC). CBC leverages topological information via Personalized PageRank to identify boundary-near, informative nodes, enabling an easy-to-hard sampling schedule that starts from likely-clean, far-from-boundary samples and progressively includes harder, boundary-adjacent ones. The authors provide a theoretical upper-bound guarantee on the expected risk under the clean distribution and demonstrate empirically that TSS outperforms state-of-the-art baselines on both small and large graphs across several noise models. The results indicate that topology-aware sample selection yields robust GNN performance in noisy-label scenarios and has potential implications for real-world graph learning tasks where labeling is imperfect.
Abstract
Despite the success of the carefully-annotated benchmarks, the effectiveness of existing graph neural networks (GNNs) can be considerably impaired in practice when the real-world graph data is noisily labeled. Previous explorations in sample selection have been demonstrated as an effective way for robust learning with noisy labels, however, the conventional studies focus on i.i.d data, and when moving to non-iid graph data and GNNs, two notable challenges remain: (1) nodes located near topological class boundaries are very informative for classification but cannot be successfully distinguished by the heuristic sample selection. (2) there is no available measure that considers the graph topological information to promote sample selection in a graph. To address this dilemma, we propose a $\textit{Topological Sample Selection}$ (TSS) method that boosts the informative sample selection process in a graph by utilising topological information. We theoretically prove that our procedure minimizes an upper bound of the expected risk under target clean distribution, and experimentally show the superiority of our method compared with state-of-the-art baselines.
