Table of Contents
Fetching ...

Coarse-to-Fine Open-Set Graph Node Classification with Large Language Models

Xueqi Ma, Xingjun Ma, Sarah Monazam Erfani, Danilo Mandic, James Bailey

TL;DR

The paper tackles open-set node classification on graphs by introducing a coarse-to-fine framework (CFC) that leverages large language models to identify semantic OOD samples and generate candidate OOD labels. A GNN-based fine classifier then discriminates ID nodes and detects OODs, aided by denoising and OOD data augmentation via manifold mixup. Final OOD classification is achieved through LLM prompts using a post-OOD label space to annotate OOD samples, yielding notable gains in both OOD detection and multi-class OOD labeling across graph and text domains. The approach emphasizes semantic, interpretable OOD representations without relying on synthetic OOD samples and demonstrates strong practical impact for open-world graph learning, with theoretical analysis supporting subspace expansion and smoother decision boundaries.

Abstract

Developing open-set classification methods capable of classifying in-distribution (ID) data while detecting out-of-distribution (OOD) samples is essential for deploying graph neural networks (GNNs) in open-world scenarios. Existing methods typically treat all OOD samples as a single class, despite real-world applications, especially high-stake settings such as fraud detection and medical diagnosis, demanding deeper insights into OOD samples, including their probable labels. This raises a critical question: can OOD detection be extended to OOD classification without true label information? To address this question, we propose a Coarse-to-Fine open-set Classification (CFC) framework that leverages large language models (LLMs) for graph datasets. CFC consists of three key components: a coarse classifier that uses LLM prompts for OOD detection and outlier label generation, a GNN-based fine classifier trained with OOD samples identified by the coarse classifier for enhanced OOD detection and ID classification, and refined OOD classification achieved through LLM prompts and post-processed OOD labels. Unlike methods that rely on synthetic or auxiliary OOD samples, CFC employs semantic OOD instances that are genuinely out-of-distribution based on their inherent meaning, improving interpretability and practical utility. Experimental results show that CFC improves OOD detection by ten percent over state-of-the-art methods on graph and text domains and achieves up to seventy percent accuracy in OOD classification on graph datasets.

Coarse-to-Fine Open-Set Graph Node Classification with Large Language Models

TL;DR

The paper tackles open-set node classification on graphs by introducing a coarse-to-fine framework (CFC) that leverages large language models to identify semantic OOD samples and generate candidate OOD labels. A GNN-based fine classifier then discriminates ID nodes and detects OODs, aided by denoising and OOD data augmentation via manifold mixup. Final OOD classification is achieved through LLM prompts using a post-OOD label space to annotate OOD samples, yielding notable gains in both OOD detection and multi-class OOD labeling across graph and text domains. The approach emphasizes semantic, interpretable OOD representations without relying on synthetic OOD samples and demonstrates strong practical impact for open-world graph learning, with theoretical analysis supporting subspace expansion and smoother decision boundaries.

Abstract

Developing open-set classification methods capable of classifying in-distribution (ID) data while detecting out-of-distribution (OOD) samples is essential for deploying graph neural networks (GNNs) in open-world scenarios. Existing methods typically treat all OOD samples as a single class, despite real-world applications, especially high-stake settings such as fraud detection and medical diagnosis, demanding deeper insights into OOD samples, including their probable labels. This raises a critical question: can OOD detection be extended to OOD classification without true label information? To address this question, we propose a Coarse-to-Fine open-set Classification (CFC) framework that leverages large language models (LLMs) for graph datasets. CFC consists of three key components: a coarse classifier that uses LLM prompts for OOD detection and outlier label generation, a GNN-based fine classifier trained with OOD samples identified by the coarse classifier for enhanced OOD detection and ID classification, and refined OOD classification achieved through LLM prompts and post-processed OOD labels. Unlike methods that rely on synthetic or auxiliary OOD samples, CFC employs semantic OOD instances that are genuinely out-of-distribution based on their inherent meaning, improving interpretability and practical utility. Experimental results show that CFC improves OOD detection by ten percent over state-of-the-art methods on graph and text domains and achieves up to seventy percent accuracy in OOD classification on graph datasets.

Paper Structure

This paper contains 53 sections, 8 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Comparison of subspaces of methods without semantic OOD information and our proposed CFC, which incorporates such information. Blue regions denote ID subspaces; other regions show OOD subspaces. CFC provides larger embedding space (pink), enabling direct OOD identification.
  • Figure 2: LLM prompts for Easy-Reject and Hard-Reject OOD detection include both Q(uestion) and A(nswer) contents. The inputs are [text] (describing the graph node) and [ID label space] (a list of ID categories, e.g., [machine learning, neural networks, ...]). For Hard-Reject OOD detection, we first determine the [Major Category] of ID classes and the [candidate OOD label space], then use [text], [ID label space], and [candidate OOD label space] for OOD detection and category generation.
  • Figure 3: LLM prompts with [text] and [post-OOD label space] for OOD classification.
  • Figure 4: Ablation study on (a) LLM prompts for OOD identification, (b) Various LLM for OOD identification, and (c) LLM prompts for OOD classification on Cora and Citeseer.
  • Figure 5: Study on the effect of (a) the number of identified OOD samples from coarse-grained classification, and (b) the number of generated OOD samples by manifold mixup method for the CFC performance on Cora and Citeseer.
  • ...and 1 more figures

Theorems & Definitions (1)

  • proof