Table of Contents
Fetching ...

Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs

Yushi Feng, Tsai Hor Chan, Guosheng Yin, Lequan Yu

TL;DR

The paper tackles data scarcity and noise in graph representation learning and the barrier posed by white-box LLM augmenters. It introduces DemoGraph, a black-box, context-driven graph data augmenter that constructs latent knowledge graphs from prompts and dynamically merges them into the training graph, with granularity-aware prompting and instruction fine-tuning to manage sparsity. Empirical results across generic benchmarks, large-scale graphs, and electronic health records demonstrate superior performance and improved interpretability, validating the method's robustness and scalability. The approach broadens the use of open-world domain knowledge in graph learning and offers a practical, democratized framework for LLM–assisted data augmentation with broad potential applications beyond healthcare.

Abstract

Data augmentation is necessary for graph representation learning due to the scarcity and noise present in graph data. Most of the existing augmentation methods overlook the context information inherited from the dataset as they rely solely on the graph structure for augmentation. Despite the success of some large language model-based (LLM) graph learning methods, they are mostly white-box which require access to the weights or latent features from the open-access LLMs, making them difficult to be democratized for everyone as existing LLMs are mostly closed-source for commercial considerations. To overcome these limitations, we propose a black-box context-driven graph data augmentation approach, with the guidance of LLMs -- DemoGraph. Leveraging the text prompt as context-related information, we task the LLM with generating knowledge graphs (KGs), which allow us to capture the structural interactions from the text outputs. We then design a dynamic merging schema to stochastically integrate the LLM-generated KGs into the original graph during training. To control the sparsity of the augmented graph, we further devise a granularity-aware prompting strategy and an instruction fine-tuning module, which seamlessly generates text prompts according to different granularity levels of the dataset. Extensive experiments on various graph learning tasks validate the effectiveness of our method over existing graph data augmentation methods. Notably, our approach excels in scenarios involving electronic health records (EHRs), which validates its maximal utilization of contextual knowledge, leading to enhanced predictive performance and interpretability.

Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs

TL;DR

The paper tackles data scarcity and noise in graph representation learning and the barrier posed by white-box LLM augmenters. It introduces DemoGraph, a black-box, context-driven graph data augmenter that constructs latent knowledge graphs from prompts and dynamically merges them into the training graph, with granularity-aware prompting and instruction fine-tuning to manage sparsity. Empirical results across generic benchmarks, large-scale graphs, and electronic health records demonstrate superior performance and improved interpretability, validating the method's robustness and scalability. The approach broadens the use of open-world domain knowledge in graph learning and offers a practical, democratized framework for LLM–assisted data augmentation with broad potential applications beyond healthcare.

Abstract

Data augmentation is necessary for graph representation learning due to the scarcity and noise present in graph data. Most of the existing augmentation methods overlook the context information inherited from the dataset as they rely solely on the graph structure for augmentation. Despite the success of some large language model-based (LLM) graph learning methods, they are mostly white-box which require access to the weights or latent features from the open-access LLMs, making them difficult to be democratized for everyone as existing LLMs are mostly closed-source for commercial considerations. To overcome these limitations, we propose a black-box context-driven graph data augmentation approach, with the guidance of LLMs -- DemoGraph. Leveraging the text prompt as context-related information, we task the LLM with generating knowledge graphs (KGs), which allow us to capture the structural interactions from the text outputs. We then design a dynamic merging schema to stochastically integrate the LLM-generated KGs into the original graph during training. To control the sparsity of the augmented graph, we further devise a granularity-aware prompting strategy and an instruction fine-tuning module, which seamlessly generates text prompts according to different granularity levels of the dataset. Extensive experiments on various graph learning tasks validate the effectiveness of our method over existing graph data augmentation methods. Notably, our approach excels in scenarios involving electronic health records (EHRs), which validates its maximal utilization of contextual knowledge, leading to enhanced predictive performance and interpretability.

Paper Structure

This paper contains 27 sections, 2 equations, 10 figures, 15 tables, 1 algorithm.

Figures (10)

  • Figure 1: Schematic illustration of the feature distribution of original graph $\mathcal{G}_0$ from observations and $\mathcal{G}^{\text{aug}}$, which represents the augmented graph for $\mathcal{G}_0$ after merging the context knowledge in terms of $\mathcal{KG}$. After performing graph data augmentation with LLM-guided DemoGraph, $\mathcal{G}^{\text{aug}}$ is closer to the true representation $\mathcal{G}_t$.
  • Figure 2: Overview of our proposed DemoGraph framework.Given a dataset, we first construct a graph $\mathcal{G}_0$ to highlight the relational information, and then perform context-driven knowledge retrieval by utilizing the original dataset and a frozen generative pre-trained LLM. We conduct contextual, adaptive, sparsity-controllable and granularity-aware prompt learning on the LLM, thus obtaining either concept-specific KGs or important extra concept nodes at different levels after refinement. For the original graph $\mathcal{G}_0$, we perform graph data augmentation with the domain-knowledge injection procedure. We train a GNN model on the augmented graph $\mathcal{G}^\text{aug}$, thus our framework is able to handle a wide range of downstream tasks across various domains depending on the original datasets.
  • Figure 3: Concept pruning via instruction fine-tuning, where trivial concepts can be pruned by re-prompting the coarse set of concepts to the LLM.
  • Figure 4: Visualization of the learned node embeddings w/ (left) and w/o (right) our graph data augmentation, respectively. We use MIMIC-III as the example and colour nodes differently by their entity types.
  • Figure 5: Visualization of the interpretability of DemoGraph: a visit node (blue) and related concept nodes (red), with attention scores visiualized in size/shade of red nodes.
  • ...and 5 more figures