Table of Contents
Fetching ...

LEGO-Learn: Label-Efficient Graph Open-Set Learning

Haoyan Xu, Kay Liu, Zhengtao Yao, Philip S. Yu, Mengyuan Li, Kaize Ding, Yue Zhao

TL;DR

This work tackles graph open-set learning under a label-budget constraint by introducing LEGO-Learn, a framework that filters out OOD nodes with a GNN, selects highly informative ID nodes via K-Medoids, and trains an ID classifier with a weighted C+1 loss to balance purity and informativeness. The method iteratively refines the ID classifier while pruning OOD samples and leveraging post-hoc OOD detection to tighten separation between known and unknown classes. Empirical results on four real-world datasets show robust gains in ID accuracy and OOD detection (AUROC/AUPR), and ablations confirm the importance of each component, including the filtering, clustering, and weighting strategies. LEGO-Learn thus provides a practical, scalable solution for label-efficient graph open-set learning with meaningful real-world impact in domains where labeling cost is high and unexpected data are common.

Abstract

How can we train graph-based models to recognize unseen classes while keeping labeling costs low? Graph open-set learning (GOL) and out-of-distribution (OOD) detection aim to address this challenge by training models that can accurately classify known, in-distribution (ID) classes while identifying and handling previously unseen classes during inference. It is critical for high-stakes, real-world applications where models frequently encounter unexpected data, including finance, security, and healthcare. However, current GOL methods assume access to many labeled ID samples, which is unrealistic for large-scale graphs due to high annotation costs. In this paper, we propose LEGO-Learn (Label-Efficient Graph Open-set Learning), a novel framework that tackles open-set node classification on graphs within a given label budget by selecting the most informative ID nodes. LEGO-Learn employs a GNN-based filter to identify and exclude potential OOD nodes and then select highly informative ID nodes for labeling using the K-Medoids algorithm. To prevent the filter from discarding valuable ID examples, we introduce a classifier that differentiates between the C known ID classes and an additional class representing OOD nodes (hence, a C+1 classifier). This classifier uses a weighted cross-entropy loss to balance the removal of OOD nodes while retaining informative ID nodes. Experimental results on four real-world datasets demonstrate that LEGO-Learn significantly outperforms leading methods, with up to a 6.62% improvement in ID classification accuracy and a 7.49% increase in AUROC for OOD detection.

LEGO-Learn: Label-Efficient Graph Open-Set Learning

TL;DR

This work tackles graph open-set learning under a label-budget constraint by introducing LEGO-Learn, a framework that filters out OOD nodes with a GNN, selects highly informative ID nodes via K-Medoids, and trains an ID classifier with a weighted C+1 loss to balance purity and informativeness. The method iteratively refines the ID classifier while pruning OOD samples and leveraging post-hoc OOD detection to tighten separation between known and unknown classes. Empirical results on four real-world datasets show robust gains in ID accuracy and OOD detection (AUROC/AUPR), and ablations confirm the importance of each component, including the filtering, clustering, and weighting strategies. LEGO-Learn thus provides a practical, scalable solution for label-efficient graph open-set learning with meaningful real-world impact in domains where labeling cost is high and unexpected data are common.

Abstract

How can we train graph-based models to recognize unseen classes while keeping labeling costs low? Graph open-set learning (GOL) and out-of-distribution (OOD) detection aim to address this challenge by training models that can accurately classify known, in-distribution (ID) classes while identifying and handling previously unseen classes during inference. It is critical for high-stakes, real-world applications where models frequently encounter unexpected data, including finance, security, and healthcare. However, current GOL methods assume access to many labeled ID samples, which is unrealistic for large-scale graphs due to high annotation costs. In this paper, we propose LEGO-Learn (Label-Efficient Graph Open-set Learning), a novel framework that tackles open-set node classification on graphs within a given label budget by selecting the most informative ID nodes. LEGO-Learn employs a GNN-based filter to identify and exclude potential OOD nodes and then select highly informative ID nodes for labeling using the K-Medoids algorithm. To prevent the filter from discarding valuable ID examples, we introduce a classifier that differentiates between the C known ID classes and an additional class representing OOD nodes (hence, a C+1 classifier). This classifier uses a weighted cross-entropy loss to balance the removal of OOD nodes while retaining informative ID nodes. Experimental results on four real-world datasets demonstrate that LEGO-Learn significantly outperforms leading methods, with up to a 6.62% improvement in ID classification accuracy and a 7.49% increase in AUROC for OOD detection.

Paper Structure

This paper contains 30 sections, 12 equations, 3 figures, 12 tables, 1 algorithm.

Figures (3)

  • Figure 1: The illustration of graph open-set classification in a citation network. We aim to classify papers into ML research fields such as robotics, CV, and NLP (ID classes). We want to find nodes from ID classes to train the ID classifier.
  • Figure 2: An overview of our framework LEGO-Learn. The first step is to use a GNN-based filter to identify and remove OOD nodes, while using a $C+1$ classifier with weighted cross-entropy loss to avoid mistakenly eliminating valuable ID nodes (§ \ref{['subsec:filter']}). A K-Medoids-based node selection method (§ \ref{['subsec:selection']}) is then applied to choose the most informative ID nodes, which are annotated and used for the next round of training the ID classifier (§ \ref{['subsec:classifier']}). Finally, the filter is retrained with both ID and unknown nodes, and a post-hoc OOD detection method is applied to strengthen the ID classifier's ability to recognize unseen classes (§ \ref{['subsec:post-hoc']}).
  • Figure 3: OOD scores histogram