Table of Contents
Fetching ...

LPNL: Scalable Link Prediction with Large Language Models

Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Xueqi Cheng

TL;DR

This work tackles scalable link prediction on large-scale heterogeneous graphs under token-length constraints by introducing LPNL, an LLM-based framework that converts graph context into natural-language prompts. It combines a two-stage sampling pipeline (normalized degree sampling and Personalized PageRank) with a divide-and-conquer prediction strategy and self-supervised fine-tuning of a T5-base backbone, guided by a token-length limit $L$. Empirically, LPNL outperforms advanced GNN baselines on four OAG subgraphs across NDCG, MRR, and Hits@1, and demonstrates strong few-shot learning and cross-domain transfer capabilities. The results indicate that carefully designed prompts and informative anchor-based context enable scalable, robust reasoning over large heterogeneous graphs using LLMs.

Abstract

Exploring the application of large language models (LLMs) to graph learning is a emerging endeavor. However, the vast amount of information inherent in large graphs poses significant challenges to this process. This work focuses on the link prediction task and introduces $\textbf{LPNL}$ (Link Prediction via Natural Language), a framework based on large language models designed for scalable link prediction on large-scale heterogeneous graphs. We design novel prompts for link prediction that articulate graph details in natural language. We propose a two-stage sampling pipeline to extract crucial information from the graphs, and a divide-and-conquer strategy to control the input tokens within predefined limits, addressing the challenge of overwhelming information. We fine-tune a T5 model based on our self-supervised learning designed for link prediction. Extensive experimental results demonstrate that LPNL outperforms multiple advanced baselines in link prediction tasks on large-scale graphs.

LPNL: Scalable Link Prediction with Large Language Models

TL;DR

This work tackles scalable link prediction on large-scale heterogeneous graphs under token-length constraints by introducing LPNL, an LLM-based framework that converts graph context into natural-language prompts. It combines a two-stage sampling pipeline (normalized degree sampling and Personalized PageRank) with a divide-and-conquer prediction strategy and self-supervised fine-tuning of a T5-base backbone, guided by a token-length limit . Empirically, LPNL outperforms advanced GNN baselines on four OAG subgraphs across NDCG, MRR, and Hits@1, and demonstrates strong few-shot learning and cross-domain transfer capabilities. The results indicate that carefully designed prompts and informative anchor-based context enable scalable, robust reasoning over large heterogeneous graphs using LLMs.

Abstract

Exploring the application of large language models (LLMs) to graph learning is a emerging endeavor. However, the vast amount of information inherent in large graphs poses significant challenges to this process. This work focuses on the link prediction task and introduces (Link Prediction via Natural Language), a framework based on large language models designed for scalable link prediction on large-scale heterogeneous graphs. We design novel prompts for link prediction that articulate graph details in natural language. We propose a two-stage sampling pipeline to extract crucial information from the graphs, and a divide-and-conquer strategy to control the input tokens within predefined limits, addressing the challenge of overwhelming information. We fine-tune a T5 model based on our self-supervised learning designed for link prediction. Extensive experimental results demonstrate that LPNL outperforms multiple advanced baselines in link prediction tasks on large-scale graphs.
Paper Structure (16 sections, 4 equations, 6 figures, 5 tables)

This paper contains 16 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: An example of heterogeneous graph
  • Figure 2: The framework of LPNL. For an input heterogeneous graph with link prediction tasks, LPNL consists of three steps: (1) conduct a two-stage sampling on the source node and each candidate neighbor from the original candidate set to acquire anchor nodes. (2) Generate prompts based on these anchor nodes and input them into LLMs for predictions. (3) Refine the candidate set based on prediction results and iteratively apply this divide-and-conquer process to obtain the distinct link prediction result $c^*$.
  • Figure 3: The prompt example consists of three components: prefix_question: a selective question; source_node_description: the description of the source node and its corresponding anchor nodes; candidate_nodes_description: the description of candidate neighbors and the anchor nodes corresponding to each candidate neighbor.
  • Figure 4: For a link prediction task involving 100 candidate neighbors, we set the candidate length limit $L$ to 5. The candidate neighbors can be divided into 20 sets, followed by three rounds of divide-and-conquer. This process ultimately yields a unique prediction result.
  • Figure 5: Cross-domain transfer results.
  • ...and 1 more figures