Table of Contents
Fetching ...

BANGS: Game-Theoretic Node Selection for Graph Self-Training

Fangxin Wang, Kay Liu, Sourav Medya, Philip S. Yu

TL;DR

Unlike traditional methods that rank and select nodes independently, BANGS considers nodes as a collective set in the self-training process, demonstrating superior performance and robustness across various datasets, base models, and hyperparameter settings, outperforming existing techniques.

Abstract

Graph self-training is a semi-supervised learning method that iteratively selects a set of unlabeled data to retrain the underlying graph neural network (GNN) model and improve its prediction performance. While selecting highly confident nodes has proven effective for self-training, this pseudo-labeling strategy ignores the combinatorial dependencies between nodes and suffers from a local view of the distribution. To overcome these issues, we propose BANGS, a novel framework that unifies the labeling strategy with conditional mutual information as the objective of node selection. Our approach -- grounded in game theory -- selects nodes in a combinatorial fashion and provides theoretical guarantees for robustness under noisy objective. More specifically, unlike traditional methods that rank and select nodes independently, BANGS considers nodes as a collective set in the self-training process. Our method demonstrates superior performance and robustness across various datasets, base models, and hyperparameter settings, outperforming existing techniques. The codebase is available on https://github.com/fangxin-wang/BANGS .

BANGS: Game-Theoretic Node Selection for Graph Self-Training

TL;DR

Unlike traditional methods that rank and select nodes independently, BANGS considers nodes as a collective set in the self-training process, demonstrating superior performance and robustness across various datasets, base models, and hyperparameter settings, outperforming existing techniques.

Abstract

Graph self-training is a semi-supervised learning method that iteratively selects a set of unlabeled data to retrain the underlying graph neural network (GNN) model and improve its prediction performance. While selecting highly confident nodes has proven effective for self-training, this pseudo-labeling strategy ignores the combinatorial dependencies between nodes and suffers from a local view of the distribution. To overcome these issues, we propose BANGS, a novel framework that unifies the labeling strategy with conditional mutual information as the objective of node selection. Our approach -- grounded in game theory -- selects nodes in a combinatorial fashion and provides theoretical guarantees for robustness under noisy objective. More specifically, unlike traditional methods that rank and select nodes independently, BANGS considers nodes as a collective set in the self-training process. Our method demonstrates superior performance and robustness across various datasets, base models, and hyperparameter settings, outperforming existing techniques. The codebase is available on https://github.com/fangxin-wang/BANGS .

Paper Structure

This paper contains 33 sections, 5 theorems, 55 equations, 4 figures, 8 tables, 1 algorithm.

Key Result

Lemma 2.1

Maximizing the mutual information between unlabeled data distribution and its prediction distribution is roughly equivalent to simultaneously maximizing entropy over the unlabeled dataset and minimizing the sum of individual prediction entropy over all unlabeled nodes.

Figures (4)

  • Figure 1: The workflow of Bangs. We utilize the teacher model's predictions to propagate features from both labeled nodes and candidate set ${\mathbb{S}}$, estimating the logits of unlabeled nodes and the Banzhaf value of set ${\mathbb{S}}$ (Equation \ref{['eq: banz']}). For simplicity, only the first step of feature propagation is shown. Using Banzhaf values we rank the individual contributions of unlabeled nodes and add the top $k$ into the pseudo-label set. The student model is subsequently retrained using this updated set.
  • Figure 2: Plots of hyperparameter and robustness analysis on Cora dataset. Our method retains the validity and superiority over baselines under different settings and hyperparameters.
  • Figure 3: Pipeline of Bangs
  • Figure 4: The test accuracy of PubMed data with different node selection criteria in self-training iteration with 40 rounds. In each round, 100 nodes are selected to pseudo-label. Our method with confidence calibration outperforms others.

Theorems & Definitions (12)

  • Lemma 2.1
  • Definition 3.1: Feature influence distribution
  • Theorem 3.1: Feature influence computation via random walk
  • Definition 3.2: Output feature estimation with propagation
  • Definition 3.3: k-Bounded Banzhaf value
  • Theorem 3.2
  • proof
  • Lemma B.1
  • proof
  • Lemma B.2
  • ...and 2 more