LPNL: Scalable Link Prediction with Large Language Models
Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Xueqi Cheng
TL;DR
This work tackles scalable link prediction on large-scale heterogeneous graphs under token-length constraints by introducing LPNL, an LLM-based framework that converts graph context into natural-language prompts. It combines a two-stage sampling pipeline (normalized degree sampling and Personalized PageRank) with a divide-and-conquer prediction strategy and self-supervised fine-tuning of a T5-base backbone, guided by a token-length limit $L$. Empirically, LPNL outperforms advanced GNN baselines on four OAG subgraphs across NDCG, MRR, and Hits@1, and demonstrates strong few-shot learning and cross-domain transfer capabilities. The results indicate that carefully designed prompts and informative anchor-based context enable scalable, robust reasoning over large heterogeneous graphs using LLMs.
Abstract
Exploring the application of large language models (LLMs) to graph learning is a emerging endeavor. However, the vast amount of information inherent in large graphs poses significant challenges to this process. This work focuses on the link prediction task and introduces $\textbf{LPNL}$ (Link Prediction via Natural Language), a framework based on large language models designed for scalable link prediction on large-scale heterogeneous graphs. We design novel prompts for link prediction that articulate graph details in natural language. We propose a two-stage sampling pipeline to extract crucial information from the graphs, and a divide-and-conquer strategy to control the input tokens within predefined limits, addressing the challenge of overwhelming information. We fine-tune a T5 model based on our self-supervised learning designed for link prediction. Extensive experimental results demonstrate that LPNL outperforms multiple advanced baselines in link prediction tasks on large-scale graphs.
