GraphiT: Efficient Node Classification on Text-Attributed Graphs with Prompt Optimized LLMs
Shima Khoshraftar, Niaz Abedini, Amir Hajian
TL;DR
GraphiT tackles node classification on text-attributed graphs by converting neighborhood information into concise textual prompts and automatically optimizing the LLM prompts with the DSPy framework. It introduces neighbor keyphrases as an efficient graph encoding, reducing context length while preserving predictive power, and uses COPRO-based prompt optimization to tailor instructions and demonstrations without extensive manual tuning. Empirical results across Cora, PubMed, and Ogbn-arxiv show GraphiT consistently surpasses vanilla LLM baselines and exceeds several prior LLM-based methods, with PubMed attaining competitive results against GCN; ablations confirm the effectiveness and token-efficiency of neighbor keyphrases. The approach offers a reproducible, scalable pathway for leverage LLMs in graph prediction tasks, highlighting practical gains in both accuracy and cost via principled encoding and automated prompt design.
Abstract
The application of large language models (LLMs) to graph data has attracted a lot of attention recently. LLMs allow us to use deep contextual embeddings from pretrained models in text-attributed graphs, where shallow embeddings are often used for the text attributes of nodes. However, it is still challenging to efficiently encode the graph structure and features into a sequential form for use by LLMs. In addition, the performance of an LLM alone, is highly dependent on the structure of the input prompt, which limits their effectiveness as a reliable approach and often requires iterative manual adjustments that could be slow, tedious and difficult to replicate programmatically. In this paper, we propose GraphiT (Graphs in Text), a framework for encoding graphs into a textual format and optimizing LLM prompts for graph prediction tasks. Here we focus on node classification for text-attributed graphs. We encode the graph data for every node and its neighborhood into a concise text to enable LLMs to better utilize the information in the graph. We then further programmatically optimize the LLM prompts using the DSPy framework to automate this step and make it more efficient and reproducible. GraphiT outperforms our LLM-based baselines on three datasets and we show how the optimization step in GraphiT leads to measurably better results without manual prompt tweaking. We also demonstrated that our graph encoding approach is competitive to other graph encoding methods while being less expensive because it uses significantly less tokens for the same task.
