TimeXL: Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop
Yushan Jiang, Wenchao Yu, Geon Lee, Dongjin Song, Kijung Shin, Wei Cheng, Yanchi Liu, Haifeng Chen
TL;DR
TimeXL tackles explainable multi-modal time-series forecasting by integrating a prototype-based encoder for time series and text with an LLM-in-the-loop consisting of prediction, reflection, and refinement agents. The encoder generates case-based explanations and predictions, which the prediction LLM refines using those explanations; a reflection LLM identifies textual weaknesses and a refinement LLM updates the context, triggering encoder retraining in a closed loop. The final prediction blends encoder-based probabilities and LLM-derived forecasts via $\oldsymbol{\hat{y}} = \alpha \boldsymbol{\hat{y}}_{\text{enc}} + (1-\alpha) \boldsymbol{\hat{y}}_{\text{LLM}}$, and the process yields improved performance and faithful, human-centric explanations. Experiments on four real-world datasets show TimeXL achieving up to 8.9% AUROC improvements and delivering explainable, multi-modal reasoning that enhances trust and decision-making in complex settings.
Abstract
Time series analysis provides essential insights for real-world system dynamics and informs downstream decision-making, yet most existing methods often overlook the rich contextual signals present in auxiliary modalities. To bridge this gap, we introduce TimeXL, a multi-modal prediction framework that integrates a prototype-based time series encoder with three collaborating Large Language Models (LLMs) to deliver more accurate predictions and interpretable explanations. First, a multi-modal prototype-based encoder processes both time series and textual inputs to generate preliminary forecasts alongside case-based rationales. These outputs then feed into a prediction LLM, which refines the forecasts by reasoning over the encoder's predictions and explanations. Next, a reflection LLM compares the predicted values against the ground truth, identifying textual inconsistencies or noise. Guided by this feedback, a refinement LLM iteratively enhances text quality and triggers encoder retraining. This closed-loop workflow-prediction, critique (reflect), and refinement-continuously boosts the framework's performance and interpretability. Empirical evaluations on four real-world datasets demonstrate that TimeXL achieves up to 8.9% improvement in AUC and produces human-centric, multi-modal explanations, highlighting the power of LLM-driven reasoning for time series prediction.
