Table of Contents
Fetching ...

Mitigating Label Noise on Graph via Topological Sample Selection

Yuhao Wu, Jiangchao Yao, Xiaobo Xia, Jun Yu, Ruxin Wang, Bo Han, Tongliang Liu

TL;DR

The paper addresses the challenge of label noise in graph-structured data by introducing Topological Sample Selection (TSS), a graph curriculum learning framework guided by Class-conditional Betweenness Centrality (CBC). CBC leverages topological information via Personalized PageRank to identify boundary-near, informative nodes, enabling an easy-to-hard sampling schedule that starts from likely-clean, far-from-boundary samples and progressively includes harder, boundary-adjacent ones. The authors provide a theoretical upper-bound guarantee on the expected risk under the clean distribution and demonstrate empirically that TSS outperforms state-of-the-art baselines on both small and large graphs across several noise models. The results indicate that topology-aware sample selection yields robust GNN performance in noisy-label scenarios and has potential implications for real-world graph learning tasks where labeling is imperfect.

Abstract

Despite the success of the carefully-annotated benchmarks, the effectiveness of existing graph neural networks (GNNs) can be considerably impaired in practice when the real-world graph data is noisily labeled. Previous explorations in sample selection have been demonstrated as an effective way for robust learning with noisy labels, however, the conventional studies focus on i.i.d data, and when moving to non-iid graph data and GNNs, two notable challenges remain: (1) nodes located near topological class boundaries are very informative for classification but cannot be successfully distinguished by the heuristic sample selection. (2) there is no available measure that considers the graph topological information to promote sample selection in a graph. To address this dilemma, we propose a $\textit{Topological Sample Selection}$ (TSS) method that boosts the informative sample selection process in a graph by utilising topological information. We theoretically prove that our procedure minimizes an upper bound of the expected risk under target clean distribution, and experimentally show the superiority of our method compared with state-of-the-art baselines.

Mitigating Label Noise on Graph via Topological Sample Selection

TL;DR

The paper addresses the challenge of label noise in graph-structured data by introducing Topological Sample Selection (TSS), a graph curriculum learning framework guided by Class-conditional Betweenness Centrality (CBC). CBC leverages topological information via Personalized PageRank to identify boundary-near, informative nodes, enabling an easy-to-hard sampling schedule that starts from likely-clean, far-from-boundary samples and progressively includes harder, boundary-adjacent ones. The authors provide a theoretical upper-bound guarantee on the expected risk under the clean distribution and demonstrate empirically that TSS outperforms state-of-the-art baselines on both small and large graphs across several noise models. The results indicate that topology-aware sample selection yields robust GNN performance in noisy-label scenarios and has potential implications for real-world graph learning tasks where labeling is imperfect.

Abstract

Despite the success of the carefully-annotated benchmarks, the effectiveness of existing graph neural networks (GNNs) can be considerably impaired in practice when the real-world graph data is noisily labeled. Previous explorations in sample selection have been demonstrated as an effective way for robust learning with noisy labels, however, the conventional studies focus on i.i.d data, and when moving to non-iid graph data and GNNs, two notable challenges remain: (1) nodes located near topological class boundaries are very informative for classification but cannot be successfully distinguished by the heuristic sample selection. (2) there is no available measure that considers the graph topological information to promote sample selection in a graph. To address this dilemma, we propose a (TSS) method that boosts the informative sample selection process in a graph by utilising topological information. We theoretically prove that our procedure minimizes an upper bound of the expected risk under target clean distribution, and experimentally show the superiority of our method compared with state-of-the-art baselines.
Paper Structure (48 sections, 1 theorem, 31 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 48 sections, 1 theorem, 31 equations, 11 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Suppose $\{(Z_{\mathbf{x}_i},y_i)\}^{m}_{i=1}$ are i.i.d. samples drawn from the pace distribution $Q_{\lambda}$ with radius $|X| \leq R$. Denote $m_{+}/m_{-}$ be the number of positive/negative samples and $m^{*} = \min\{m_{-},m_{+}\}$. Let $\mathcal{H} = \{\mathbf{x} \rightarrow \mathbf{w}^{T}\mat where $E^{+}$, $E^{-}$ denote error distributions that capture the deviation from $\mathbbm{P}^{+}_

Figures (11)

  • Figure 1: Illustration of noisily labeled nodes with different topological structures. $v_{1}$ is a mislabeled node located near class boundaries while $v_{15}$ is a mislabeled node far away from class boundaries.
  • Figure 2: Robustness of Class-conditional Betweenness Centrality (t-SNE visualization of node embeddings based on trained GNNs from the CORA dataset). (a) clean labeled nodes with less CBC (lighter colour) are farther-away from class boundaries than those with high CBC (darker colour). (b)(c)(d) Compared with other two difficulty measurers wei2023clnodeli2023curriculum in graph curriculum learning under 40% Symmetric label noise, CBC clearly shows superiority in terms of the differentiation w.r.t. boundary-near nodes.
  • Figure 3: Correlation between F-score of extracting confident nodes and overall CBC of the noisily labeled subsets in a graph with 30% Symmetric label noise. The Pearson coefficient is $-0.9276$ on 50 randomly selected subsets with $p$ value smaller than $0.0001$.
  • Figure 4: The distributions of the CBC score w.r.t. nodes on WikiCS with $40\%$ and $60\%$ symmetric noise (symm.) or $40\%$ and $60\%$ instance-based noise (inst.). The nodes are considered "far from topological class boundaries" (far from boundaries.) when their two-hop neighbours belong to the same class; conversely, nodes are categorized as "near topological class boundaries" (near boundaries.) when this condition does not hold. More comprehensive experiments in the Appendix \ref{['CBC_diss']}.
  • Figure 5: The hyperparameter analysis of TSS. The experiment results are reported over five trials under the 20% Symmetric noise. (a) The test accuracy of TSS with three different pacing functions on various datasets. (b) The test accuracy of TSS with increasing $\lambda_{0}$ on CORA and PubMed.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Definition 2.1: Class-conditional Betweenness Centrality
  • Definition 2.2: Topological Sample Selection
  • Theorem 1
  • Definition 1.1: Betweenness centrality
  • Definition 1.2: Class-conditional Betweenness Centrality
  • Definition 3.1
  • proof
  • proof