LPNL: Scalable Link Prediction with Large Language Models

Baolong Bi; Shenghua Liu; Yiwei Wang; Lingrui Mei; Xueqi Cheng

LPNL: Scalable Link Prediction with Large Language Models

Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Xueqi Cheng

TL;DR

This work tackles scalable link prediction on large-scale heterogeneous graphs under token-length constraints by introducing LPNL, an LLM-based framework that converts graph context into natural-language prompts. It combines a two-stage sampling pipeline (normalized degree sampling and Personalized PageRank) with a divide-and-conquer prediction strategy and self-supervised fine-tuning of a T5-base backbone, guided by a token-length limit $L$. Empirically, LPNL outperforms advanced GNN baselines on four OAG subgraphs across NDCG, MRR, and Hits@1, and demonstrates strong few-shot learning and cross-domain transfer capabilities. The results indicate that carefully designed prompts and informative anchor-based context enable scalable, robust reasoning over large heterogeneous graphs using LLMs.

Abstract

Exploring the application of large language models (LLMs) to graph learning is a emerging endeavor. However, the vast amount of information inherent in large graphs poses significant challenges to this process. This work focuses on the link prediction task and introduces $\textbf{LPNL}$ (Link Prediction via Natural Language), a framework based on large language models designed for scalable link prediction on large-scale heterogeneous graphs. We design novel prompts for link prediction that articulate graph details in natural language. We propose a two-stage sampling pipeline to extract crucial information from the graphs, and a divide-and-conquer strategy to control the input tokens within predefined limits, addressing the challenge of overwhelming information. We fine-tune a T5 model based on our self-supervised learning designed for link prediction. Extensive experimental results demonstrate that LPNL outperforms multiple advanced baselines in link prediction tasks on large-scale graphs.

LPNL: Scalable Link Prediction with Large Language Models

TL;DR

. Empirically, LPNL outperforms advanced GNN baselines on four OAG subgraphs across NDCG, MRR, and Hits@1, and demonstrates strong few-shot learning and cross-domain transfer capabilities. The results indicate that carefully designed prompts and informative anchor-based context enable scalable, robust reasoning over large heterogeneous graphs using LLMs.

Abstract

(Link Prediction via Natural Language), a framework based on large language models designed for scalable link prediction on large-scale heterogeneous graphs. We design novel prompts for link prediction that articulate graph details in natural language. We propose a two-stage sampling pipeline to extract crucial information from the graphs, and a divide-and-conquer strategy to control the input tokens within predefined limits, addressing the challenge of overwhelming information. We fine-tune a T5 model based on our self-supervised learning designed for link prediction. Extensive experimental results demonstrate that LPNL outperforms multiple advanced baselines in link prediction tasks on large-scale graphs.

Paper Structure (16 sections, 4 equations, 6 figures, 5 tables)

This paper contains 16 sections, 4 equations, 6 figures, 5 tables.

Introduction
The LPNL Architecture
Preliminary
Prompt Design for Link Prediction
Two-Stage Sampling
Divide-and-Conquer Prediction
Self-Supervised Fine-tuning
Experiments
Experiment Settings
Overall Performance
Cross-Domain Knowledge Transfer
Few-Shot Learning
Ablation Study
Related Work
Discussion
...and 1 more sections

Figures (6)

Figure 1: An example of heterogeneous graph
Figure 2: The framework of LPNL. For an input heterogeneous graph with link prediction tasks, LPNL consists of three steps: (1) conduct a two-stage sampling on the source node and each candidate neighbor from the original candidate set to acquire anchor nodes. (2) Generate prompts based on these anchor nodes and input them into LLMs for predictions. (3) Refine the candidate set based on prediction results and iteratively apply this divide-and-conquer process to obtain the distinct link prediction result $c^*$.
Figure 3: The prompt example consists of three components: prefix_question: a selective question; source_node_description: the description of the source node and its corresponding anchor nodes; candidate_nodes_description: the description of candidate neighbors and the anchor nodes corresponding to each candidate neighbor.
Figure 4: For a link prediction task involving 100 candidate neighbors, we set the candidate length limit $L$ to 5. The candidate neighbors can be divided into 20 sets, followed by three rounds of divide-and-conquer. This process ultimately yields a unique prediction result.
Figure 5: Cross-domain transfer results.
...and 1 more figures

LPNL: Scalable Link Prediction with Large Language Models

TL;DR

Abstract

LPNL: Scalable Link Prediction with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)