Table of Contents
Fetching ...

TimeXL: Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop

Yushan Jiang, Wenchao Yu, Geon Lee, Dongjin Song, Kijung Shin, Wei Cheng, Yanchi Liu, Haifeng Chen

TL;DR

TimeXL tackles explainable multi-modal time-series forecasting by integrating a prototype-based encoder for time series and text with an LLM-in-the-loop consisting of prediction, reflection, and refinement agents. The encoder generates case-based explanations and predictions, which the prediction LLM refines using those explanations; a reflection LLM identifies textual weaknesses and a refinement LLM updates the context, triggering encoder retraining in a closed loop. The final prediction blends encoder-based probabilities and LLM-derived forecasts via $\oldsymbol{\hat{y}} = \alpha \boldsymbol{\hat{y}}_{\text{enc}} + (1-\alpha) \boldsymbol{\hat{y}}_{\text{LLM}}$, and the process yields improved performance and faithful, human-centric explanations. Experiments on four real-world datasets show TimeXL achieving up to 8.9% AUROC improvements and delivering explainable, multi-modal reasoning that enhances trust and decision-making in complex settings.

Abstract

Time series analysis provides essential insights for real-world system dynamics and informs downstream decision-making, yet most existing methods often overlook the rich contextual signals present in auxiliary modalities. To bridge this gap, we introduce TimeXL, a multi-modal prediction framework that integrates a prototype-based time series encoder with three collaborating Large Language Models (LLMs) to deliver more accurate predictions and interpretable explanations. First, a multi-modal prototype-based encoder processes both time series and textual inputs to generate preliminary forecasts alongside case-based rationales. These outputs then feed into a prediction LLM, which refines the forecasts by reasoning over the encoder's predictions and explanations. Next, a reflection LLM compares the predicted values against the ground truth, identifying textual inconsistencies or noise. Guided by this feedback, a refinement LLM iteratively enhances text quality and triggers encoder retraining. This closed-loop workflow-prediction, critique (reflect), and refinement-continuously boosts the framework's performance and interpretability. Empirical evaluations on four real-world datasets demonstrate that TimeXL achieves up to 8.9% improvement in AUC and produces human-centric, multi-modal explanations, highlighting the power of LLM-driven reasoning for time series prediction.

TimeXL: Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop

TL;DR

TimeXL tackles explainable multi-modal time-series forecasting by integrating a prototype-based encoder for time series and text with an LLM-in-the-loop consisting of prediction, reflection, and refinement agents. The encoder generates case-based explanations and predictions, which the prediction LLM refines using those explanations; a reflection LLM identifies textual weaknesses and a refinement LLM updates the context, triggering encoder retraining in a closed loop. The final prediction blends encoder-based probabilities and LLM-derived forecasts via , and the process yields improved performance and faithful, human-centric explanations. Experiments on four real-world datasets show TimeXL achieving up to 8.9% AUROC improvements and delivering explainable, multi-modal reasoning that enhances trust and decision-making in complex settings.

Abstract

Time series analysis provides essential insights for real-world system dynamics and informs downstream decision-making, yet most existing methods often overlook the rich contextual signals present in auxiliary modalities. To bridge this gap, we introduce TimeXL, a multi-modal prediction framework that integrates a prototype-based time series encoder with three collaborating Large Language Models (LLMs) to deliver more accurate predictions and interpretable explanations. First, a multi-modal prototype-based encoder processes both time series and textual inputs to generate preliminary forecasts alongside case-based rationales. These outputs then feed into a prediction LLM, which refines the forecasts by reasoning over the encoder's predictions and explanations. Next, a reflection LLM compares the predicted values against the ground truth, identifying textual inconsistencies or noise. Guided by this feedback, a refinement LLM iteratively enhances text quality and triggers encoder retraining. This closed-loop workflow-prediction, critique (reflect), and refinement-continuously boosts the framework's performance and interpretability. Empirical evaluations on four real-world datasets demonstrate that TimeXL achieves up to 8.9% improvement in AUC and produces human-centric, multi-modal explanations, highlighting the power of LLM-driven reasoning for time series prediction.

Paper Structure

This paper contains 36 sections, 7 equations, 26 figures, 11 tables, 1 algorithm.

Figures (26)

  • Figure 1: An overview of the TimeXL workflow. A prototype-based explainable encoder first produces predictions and case-based rationales for both time series and text. The prediction LLM refines forecasts based on these rationales (Step 1). A reflection LLM then critiques the output against ground truth (Step 2), providing feedback to detect textual noise (Step 3). Finally, a refinement LLM updates the text accordingly, triggering encoder retraining for improved accuracy and explanations (Step 4). Note that the predictions from the encoder and the LLM are also fused to enhance overall accuracy.
  • Figure 2: The training protocol of TimeXL (left), and multi-modal prototype-based encoder (right).
  • Figure 3: Key time series prototypes and text prototypes learned on the Weather dataset. Each row in the figure represents a time series prototype with different channels.
  • Figure 4: Multi-modal case-based reasoning example on the Weather dataset. The left part illustrates the reasoning process for both the original and refined text in TimeXL, with matched prototype-input pairs highlighted in the same color along with their similarity scores. The right part presents the time series reasoning in TimeXL, where matched prototypes are overlaid on the time series.
  • Figure 5: The text quality and TimeXL performance over iterations.
  • ...and 21 more figures