Table of Contents
Fetching ...

Virtual Node Generation for Node Classification in Sparsely-Labeled Graphs

Hang Cui, Tarek Abdelzaher

TL;DR

A novel node generation method that infuses a small set of high-quality synthesized nodes into the graph as additional labeled nodes to optimally expand the propagation of labeled information.

Abstract

In the broader machine learning literature, data-generation methods demonstrate promising results by generating additional informative training examples via augmenting sparse labels. Such methods are less studied in graphs due to the intricate dependencies among nodes in complex topology structures. This paper presents a novel node generation method that infuses a small set of high-quality synthesized nodes into the graph as additional labeled nodes to optimally expand the propagation of labeled information. By simply infusing additional nodes, the framework is orthogonal to the graph learning and downstream classification techniques, and thus is compatible with most popular graph pre-training (self-supervised learning), semi-supervised learning, and meta-learning methods. The contribution lies in designing the generated node set by solving a novel optimization problem. The optimization places the generated nodes in a manner that: (1) minimizes the classification loss to guarantee training accuracy and (2) maximizes label propagation to low-confidence nodes in the downstream task to ensure high-quality propagation. Theoretically, we show that the above dual optimization maximizes the global confidence of node classification. Our Experiments demonstrate statistically significant performance improvements over 14 baselines on 10 publicly available datasets.

Virtual Node Generation for Node Classification in Sparsely-Labeled Graphs

TL;DR

A novel node generation method that infuses a small set of high-quality synthesized nodes into the graph as additional labeled nodes to optimally expand the propagation of labeled information.

Abstract

In the broader machine learning literature, data-generation methods demonstrate promising results by generating additional informative training examples via augmenting sparse labels. Such methods are less studied in graphs due to the intricate dependencies among nodes in complex topology structures. This paper presents a novel node generation method that infuses a small set of high-quality synthesized nodes into the graph as additional labeled nodes to optimally expand the propagation of labeled information. By simply infusing additional nodes, the framework is orthogonal to the graph learning and downstream classification techniques, and thus is compatible with most popular graph pre-training (self-supervised learning), semi-supervised learning, and meta-learning methods. The contribution lies in designing the generated node set by solving a novel optimization problem. The optimization places the generated nodes in a manner that: (1) minimizes the classification loss to guarantee training accuracy and (2) maximizes label propagation to low-confidence nodes in the downstream task to ensure high-quality propagation. Theoretically, we show that the above dual optimization maximizes the global confidence of node classification. Our Experiments demonstrate statistically significant performance improvements over 14 baselines on 10 publicly available datasets.
Paper Structure (22 sections, 2 theorems, 21 equations, 2 figures, 7 tables, 1 algorithm)

This paper contains 22 sections, 2 theorems, 21 equations, 2 figures, 7 tables, 1 algorithm.

Key Result

proposition thmcounterproposition

Given a pre-trained link prediction model $Connect(v_i,v_j)$ and embedding matrix $H$, assume the number of generated nodes is small, s.t. $|V_s|<<|V|$, and the edges of generated nodes are created following Erdos–Renyi (ER) model proportional to the $Connect(\tilde{v},v_i)$, then the expected $\til where $d_i$ is the degree of node $i$ and $X_{\tilde{v}}$ is the feature of the generated node.

Figures (2)

  • Figure 1: Node generation framework: bridging GNN propagation and node augmentation. $G$ represents the original graph where the red circle denotes the sparsely labeled node's (red) local neighborhood, and the green circle denotes low-score regions in the graph. $G'$ is the graph with two generated nodes (orange), which is obtained from augmenting the labeled node (red).
  • Figure 2: Ablation study on hyperparameters

Theorems & Definitions (4)

  • proposition thmcounterproposition
  • proposition thmcounterproposition
  • proof
  • proof