Table of Contents
Fetching ...

Novel Node Category Detection Under Subpopulation Shift

Hsing-Huan Chung, Shravan Chaudhari, Yoav Wald, Xing Han, Joydeep Ghosh

TL;DR

This work tackles novel node category detection under subpopulation shift in attributed graphs by formulating it as a PU-learning problem without ground-truth novel labels. It introduces RECO-SLIP, a framework that combines recall-constrained optimization with a sample-efficient, graph-aware selective link prediction objective to preserve the latent subgroup structure induced by edges. Empirical results on five benchmark datasets show that RECO-SLIP consistently outperforms standard PU methods, propensity-weighting approaches, and graph PU baselines, demonstrating robustness to distribution shifts. The approach offers a practical and scalable solution for safety-critical graph applications, with code available for reproducibility and further development.

Abstract

In real-world graph data, distribution shifts can manifest in various ways, such as the emergence of new categories and changes in the relative proportions of existing categories. It is often important to detect nodes of novel categories under such distribution shifts for safety or insight discovery purposes. We introduce a new approach, Recall-Constrained Optimization with Selective Link Prediction (RECO-SLIP), to detect nodes belonging to novel categories in attributed graphs under subpopulation shifts. By integrating a recall-constrained learning framework with a sample-efficient link prediction mechanism, RECO-SLIP addresses the dual challenges of resilience against subpopulation shifts and the effective exploitation of graph structure. Our extensive empirical evaluation across multiple graph datasets demonstrates the superior performance of RECO-SLIP over existing methods. The experimental code is available at https://github.com/hsinghuan/novel-node-category-detection.

Novel Node Category Detection Under Subpopulation Shift

TL;DR

This work tackles novel node category detection under subpopulation shift in attributed graphs by formulating it as a PU-learning problem without ground-truth novel labels. It introduces RECO-SLIP, a framework that combines recall-constrained optimization with a sample-efficient, graph-aware selective link prediction objective to preserve the latent subgroup structure induced by edges. Empirical results on five benchmark datasets show that RECO-SLIP consistently outperforms standard PU methods, propensity-weighting approaches, and graph PU baselines, demonstrating robustness to distribution shifts. The approach offers a practical and scalable solution for safety-critical graph applications, with code available for reproducibility and further development.

Abstract

In real-world graph data, distribution shifts can manifest in various ways, such as the emergence of new categories and changes in the relative proportions of existing categories. It is often important to detect nodes of novel categories under such distribution shifts for safety or insight discovery purposes. We introduce a new approach, Recall-Constrained Optimization with Selective Link Prediction (RECO-SLIP), to detect nodes belonging to novel categories in attributed graphs under subpopulation shifts. By integrating a recall-constrained learning framework with a sample-efficient link prediction mechanism, RECO-SLIP addresses the dual challenges of resilience against subpopulation shifts and the effective exploitation of graph structure. Our extensive empirical evaluation across multiple graph datasets demonstrates the superior performance of RECO-SLIP over existing methods. The experimental code is available at https://github.com/hsinghuan/novel-node-category-detection.
Paper Structure (35 sections, 10 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 35 sections, 10 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: An illustration of novel node category detection under subpopulation shift using a product co-purchasing network. The target domain consists of products from the two categories that exist in the source domain (sports and kitchen) and a novel category (weapons). Meanwhile, the relative proportion of the two original categories changes from source to target. The goal is to detect the products belonging to the novel category in the target domain.
  • Figure 2: An illustration of RECO-SLIP. The upper-right module is the recall-constrained optimization component (Eq. \ref{['eq:opt_prob']}) where the classifier adjusts its scores to minimize the FPR on the source while reserving enough target nodes as novel. The bottom module is the selective link prediction component where the link prediction loss (Eq. \ref{['eq:lp']}) is imposed on the target subgraph excluding the nodes with the highest scores (Eq. \ref{['eq:z_alpha']}, \ref{['eq:e-']}, \ref{['eq:e+']}). A solid orange arrow pair between two nodes denotes their representation similarity is maximized whereas a bidirectional hollow arrow denotes the similarity is minimized.
  • Figure 3: Overall results of the shift intensity study. The x-axis represents shift intensity (NS: no shift, MS: minor shift, S: shift) and the y-axis represents AU-ROC performance.
  • Figure 4: The source and target distributions of Cora-S, Cora-MS, and Cora-NS. The last category (Category 7) is the novel category so it does not show up in the source domain. The subpopulation shift exhibited in Cora-S is significant as more than half of the source nodes belong to Category 4 while Category 4 only takes up less than 10% of the target nodes. In contrast, the relative proportions of the non-novel categories in the source and target are the same in Cora-NS.
  • Figure 5: The source and target distributions of CiteSeer-S, CiteSeer-MS, and CiteSeer-NS.
  • ...and 3 more figures