Self-Exploring Language Models for Explainable Link Forecasting on Temporal Graphs via Reinforcement Learning
Zifeng Ding, Shenyang Huang, Zeyu Cao, Emma Kondrup, Zachary Yang, Xingyue Huang, Yuan Sui, Zhangdie Yuan, Yuqicheng Zhu, Xianglong Hu, Yuan He, Farimah Poursafaei, Michael Bronstein, Andreas Vlachos
TL;DR
This work tackles explainable future-link forecasting on temporal graphs by fine-tuning LLMs with reinforcement learning. It introduces ReaL-TG, which uses Temporal Context Graph Selection, a GRPO-based RL objective with an outcome-based reward, and a QA-style prompt to generate both predictions and reasoning traces. It also proposes a novel evaluation protocol combining MRR/pMRR for predictions with an LLM-as-a-Judge to assess faithfulness, consistency, and alignment of reasoning; results show ReaL-TG-4B often outperforms much larger frontier LLMs on seen and unseen graphs while producing high-quality explanations. The framework demonstrates practical potential for explainable TG reasoning, enabling generalization to new graphs without retraining and providing a scalable, interpretable forecasting approach with a dedicated reasoning-evaluation mechanism.
Abstract
Forecasting future links is a central task in temporal graph (TG) reasoning, requiring models to leverage historical interactions to predict upcoming ones. Traditional neural approaches, such as temporal graph neural networks, achieve strong performance but lack explainability and cannot be applied to unseen graphs without retraining. Recent studies have begun to explore using large language models (LLMs) for graph reasoning, but most of them are constrained to static graphs or small synthetic TGs and lack the evaluation of the quality of reasoning traces generated by LLMs. In this work, we present Reasoning-Enhanced Learning for Temporal Graphs (ReaL-TG), a reinforcement learning framework that fine-tunes LLMs to perform explainable link forecasting on real-world TGs. ReaL-TG uses outcome-based reward to encourage models to self-explore reasoning strategies from graph structure and to produce explanations that directly justify their predictions. To enable evaluation on LLM-generated reasoning traces, we propose a new evaluation protocol combining ranking metrics with an LLM-as-a-Judge system that assesses both the quality of reasoning and the impact of hallucinations. Experiments with ReaL-TG-4B, obtained by fine-tuning Qwen3-4B under our framework, show that it outperforms much larger frontier LLMs, including GPT-5 mini, on ranking metrics, while producing high-quality explanations confirmed by both the LLM judge and human evaluation.
