Table of Contents
Fetching ...

Robust Offline Active Learning on Graphs

Yuanchen Wu, Yubai Yuan

TL;DR

This paper proposes an offline active learning method that selects nodes to query by explicitly incorporating information from both the network structure and node covariates, and establishes a theoretical relationship between generalization error and the number of nodes selected by the proposed method.

Abstract

We consider the problem of active learning on graphs, which has crucial applications in many real-world networks where labeling node responses is expensive. In this paper, we propose an offline active learning method that selects nodes to query by explicitly incorporating information from both the network structure and node covariates. Building on graph signal recovery theories and the random spectral sparsification technique, the proposed method adopts a two-stage biased sampling strategy that takes both informativeness and representativeness into consideration for node querying. Informativeness refers to the complexity of graph signals that are learnable from the responses of queried nodes, while representativeness refers to the capacity of queried nodes to control generalization errors given noisy node-level information. We establish a theoretical relationship between generalization error and the number of nodes selected by the proposed method. Our theoretical results demonstrate the trade-off between informativeness and representativeness in active learning. Extensive numerical experiments show that the proposed method is competitive with existing graph-based active learning methods, especially when node covariates and responses contain noises. Additionally, the proposed method is applicable to both regression and classification tasks on graphs.

Robust Offline Active Learning on Graphs

TL;DR

This paper proposes an offline active learning method that selects nodes to query by explicitly incorporating information from both the network structure and node covariates, and establishes a theoretical relationship between generalization error and the number of nodes selected by the proposed method.

Abstract

We consider the problem of active learning on graphs, which has crucial applications in many real-world networks where labeling node responses is expensive. In this paper, we propose an offline active learning method that selects nodes to query by explicitly incorporating information from both the network structure and node covariates. Building on graph signal recovery theories and the random spectral sparsification technique, the proposed method adopts a two-stage biased sampling strategy that takes both informativeness and representativeness into consideration for node querying. Informativeness refers to the complexity of graph signals that are learnable from the responses of queried nodes, while representativeness refers to the capacity of queried nodes to control generalization errors given noisy node-level information. We establish a theoretical relationship between generalization error and the number of nodes selected by the proposed method. Our theoretical results demonstrate the trade-off between informativeness and representativeness in active learning. Extensive numerical experiments show that the proposed method is competitive with existing graph-based active learning methods, especially when node covariates and responses contain noises. Additionally, the proposed method is applicable to both regression and classification tasks on graphs.
Paper Structure (24 sections, 4 theorems, 100 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 4 theorems, 100 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.1

Any graph signal $\mathbf{f} \in \mathbf{H}_{\omega}(\mathbf{X}, \mathbf{A})$ can be identified using labels on a subset of nodes $\mathcal{S}$ if and only if: where $\mathcal{S}^c$ denotes the complement of $\mathcal{S}$ in $\mathbf{V}$.

Figures (3)

  • Figure 1: Prediction performance on unlabeled nodes at different levels of labeling noise ($\sigma^2$). All three simulated networks have $n=100$ nodes, with the number of labeled nodes fixed at $25$.
  • Figure 2: For (a) SBM, nodes are grouped by the assigned community; for (b) BA, nodes are grouped by degree. The integer $i$ on each node represents the $i^{th}$ node queried by the proposed algorithm in one replication.
  • Figure 3: Ablation study: (a) The condition number (log scale) of the design matrix of query nodes selected by proposed method and random sampling. The effectiveness of (b) representative sampling and (c) incorporating covariate information in Algorithm 1.

Theorems & Definitions (5)

  • Definition 1
  • Theorem 3.1
  • Theorem 4.1
  • Theorem 4.2
  • Lemma A.1