Table of Contents
Fetching ...

UPL: Uncertainty-aware Pseudo-labeling for Imbalance Transductive Node Classification

Mohammad T. Teimuri, Zahra Dehghanian, Gholamali Aminian, Hamid R. Rabiee

TL;DR

This work tackles class imbalance in transductive node classification on graphs by deriving a population-risk upper bound that highlights the minority class as the primary driver of error and by proposing Uncertainty-aware Pseudo-labeling (UPL). UPL combines uncertainty estimation and thresholded pseudo-labeling with a minority-focused augmentation strategy and a balanced Softmax loss to improve minority-class performance while preserving graph structure. Theoretical guarantees via transductive Rademacher complexity are complemented by extensive empirical results showing state-of-the-art performance across both homophilic and heterophilic graphs, with reduced performance variance. The approach offers a practical, scalable path to robust imbalanced graph learning and opens avenues for extensions to multi-class, inductive settings, and heterophily-aware designs.

Abstract

Graph-structured datasets often suffer from class imbalance, which complicates node classification tasks. In this work, we address this issue by first providing an upper bound on population risk for imbalanced transductive node classification. We then propose a simple and novel algorithm, Uncertainty-aware Pseudo-labeling (UPL). Our approach leverages pseudo-labels assigned to unlabeled nodes to mitigate the adverse effects of imbalance on classification accuracy. Furthermore, the UPL algorithm enhances the accuracy of pseudo-labeling by reducing training noise of pseudo-labels through a novel uncertainty-aware approach. We comprehensively evaluate the UPL algorithm across various benchmark datasets, demonstrating its superior performance compared to existing state-of-the-art methods.

UPL: Uncertainty-aware Pseudo-labeling for Imbalance Transductive Node Classification

TL;DR

This work tackles class imbalance in transductive node classification on graphs by deriving a population-risk upper bound that highlights the minority class as the primary driver of error and by proposing Uncertainty-aware Pseudo-labeling (UPL). UPL combines uncertainty estimation and thresholded pseudo-labeling with a minority-focused augmentation strategy and a balanced Softmax loss to improve minority-class performance while preserving graph structure. Theoretical guarantees via transductive Rademacher complexity are complemented by extensive empirical results showing state-of-the-art performance across both homophilic and heterophilic graphs, with reduced performance variance. The approach offers a practical, scalable path to robust imbalanced graph learning and opens avenues for extensions to multi-class, inductive settings, and heterophily-aware designs.

Abstract

Graph-structured datasets often suffer from class imbalance, which complicates node classification tasks. In this work, we address this issue by first providing an upper bound on population risk for imbalanced transductive node classification. We then propose a simple and novel algorithm, Uncertainty-aware Pseudo-labeling (UPL). Our approach leverages pseudo-labels assigned to unlabeled nodes to mitigate the adverse effects of imbalance on classification accuracy. Furthermore, the UPL algorithm enhances the accuracy of pseudo-labeling by reducing training noise of pseudo-labels through a novel uncertainty-aware approach. We comprehensively evaluate the UPL algorithm across various benchmark datasets, demonstrating its superior performance compared to existing state-of-the-art methods.

Paper Structure

This paper contains 33 sections, 4 theorems, 28 equations, 3 figures, 18 tables, 1 algorithm.

Key Result

Proposition 4.4

Let $Q_i=(\frac{1}{u_i}+\frac{1}{m_i})$, $S_i=\frac{m_i+u_i}{(m_i+u_i-1/2)(1-1/(2\max(m_i,u_i)))}$ for $i=1,2$. Then, with probability at least $(1-\delta)$ over the choice of the training set from nodes of graphs, for all $h_\theta\in\mathcal{H}$, where $\mathfrak{R}_{m_i+u_i}(\mathcal{H})$ is the transductive Rademacher complexity for $i$-th class, $R_{\gamma}(\mathbf{Z}_{m_i},h_{\theta})$ is t

Figures (3)

  • Figure 1: Pipeline of the UPL Algorithm
  • Figure 2: F1-score for different values of upper and lower thresholds for pseudo-labeling. The Figure on the left indicates the sweeping over $\eta_l$ while $\eta_u=0.3+\eta_l$ for the CiteSeer dataset. The Figure on the right represents sweeping over $\eta_u$ from $0.3$ to $1.0$ for Cora. In both Figures, the red line represents the validation results, and blue shows the result for the test set.
  • Figure 3: Selection Edge Removal: F1 score versus number of iterations and number of edges for removal. Each plot is normalized to it's lowest value.

Theorems & Definitions (8)

  • Proposition 4.4
  • Proposition 4.5
  • Theorem 4.6
  • Lemma 3.1: Bound on infinite norm of the symmetric normalized graph filter
  • proof
  • proof
  • proof
  • proof