Table of Contents
Fetching ...

Conversational Time Series Foundation Models: Towards Explainable and Effective Forecasting

Defu Cao, Michael Gee, Jinbo Liu, Hengxuan Wang, Wei Yang, Rui Wang, Yan Liu

TL;DR

The paper tackles the lack of a universally superior time-series model by proposing TSOrchestra, an approach that uses an LLM as an intelligent judge to orchestrate an ensemble of specialized forecasting models. Central to the method is an R1-style fine-tuning pipeline that aligns the LLM's weight-redistribution reasoning with SHAP-based causal explanations, enabling forward-looking, interpretable decisions. The framework formalizes ensemble advantages under temporal incompatibility, employs SLSQP for optimization, and leverages GRPO-based RL to improve decision-making while maintaining faithfulness between explanations and model contributions. Empirically, TSOrchestra achieves state-of-the-art CRPS and MASE on GIFT-Eval across 97 settings and provides interpretable rationale for its weight configurations, demonstrating practical viability for production forecasting with transparent reasoning.

Abstract

The proliferation of time series foundation models has created a landscape where no single method achieves consistent superiority, framing the central challenge not as finding the best model, but as orchestrating an optimal ensemble with interpretability. While Large Language Models (LLMs) offer powerful reasoning capabilities, their direct application to time series forecasting has proven ineffective. We address this gap by repositioning the LLM as an intelligent judge that evaluates, explains, and strategically coordinates an ensemble of foundation models. To overcome the LLM's inherent lack of domain-specific knowledge on time series, we introduce an R1-style finetuning process, guided by SHAP-based faithfulness scores, which teaches the model to interpret ensemble weights as meaningful causal statements about temporal dynamics. The trained agent then engages in iterative, multi-turn conversations to perform forward-looking assessments, provide causally-grounded explanations for its weighting decisions, and adaptively refine the optimization strategy. Validated on the GIFT-Eval benchmark on 23 datasets across 97 settings, our approach significantly outperforms leading time series foundation models on both CRPS and MASE metrics, establishing new state-of-the-art results.

Conversational Time Series Foundation Models: Towards Explainable and Effective Forecasting

TL;DR

The paper tackles the lack of a universally superior time-series model by proposing TSOrchestra, an approach that uses an LLM as an intelligent judge to orchestrate an ensemble of specialized forecasting models. Central to the method is an R1-style fine-tuning pipeline that aligns the LLM's weight-redistribution reasoning with SHAP-based causal explanations, enabling forward-looking, interpretable decisions. The framework formalizes ensemble advantages under temporal incompatibility, employs SLSQP for optimization, and leverages GRPO-based RL to improve decision-making while maintaining faithfulness between explanations and model contributions. Empirically, TSOrchestra achieves state-of-the-art CRPS and MASE on GIFT-Eval across 97 settings and provides interpretable rationale for its weight configurations, demonstrating practical viability for production forecasting with transparent reasoning.

Abstract

The proliferation of time series foundation models has created a landscape where no single method achieves consistent superiority, framing the central challenge not as finding the best model, but as orchestrating an optimal ensemble with interpretability. While Large Language Models (LLMs) offer powerful reasoning capabilities, their direct application to time series forecasting has proven ineffective. We address this gap by repositioning the LLM as an intelligent judge that evaluates, explains, and strategically coordinates an ensemble of foundation models. To overcome the LLM's inherent lack of domain-specific knowledge on time series, we introduce an R1-style finetuning process, guided by SHAP-based faithfulness scores, which teaches the model to interpret ensemble weights as meaningful causal statements about temporal dynamics. The trained agent then engages in iterative, multi-turn conversations to perform forward-looking assessments, provide causally-grounded explanations for its weighting decisions, and adaptively refine the optimization strategy. Validated on the GIFT-Eval benchmark on 23 datasets across 97 settings, our approach significantly outperforms leading time series foundation models on both CRPS and MASE metrics, establishing new state-of-the-art results.

Paper Structure

This paper contains 48 sections, 2 theorems, 37 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

For time series with temporal incompatibility $I_T(\mathcal{D}) > 0$, given the model diversity $M$, the optimal ensemble achieves: where $\Omega(I_T, M) = \frac{I_T \cdot \log M}{1 + \exp(-\kappa \cdot I_T)}$ quantifies the ensemble advantage as a function of incompatibility.

Figures (10)

  • Figure 1: Model Architecture. The top stage shows an LLM agent guiding SLSQP ensemble optimization through iterative reasoning. The bottom stage details the agent's R1-style finetuning, where a SHAP-aligned Faithfulness Score rewards causally-grounded explanations during GRPO training.
  • Figure 2: Leaderboard comparison on the GIFT-Eval benchmark. Categories include pre-trained foundation models, deep learning baselines, agent-based approaches, zero-shot methods, and classical statistical models.
  • Figure 3: Ablation study and design space exploration. Left: Impact of ensemble size on performance across different time horizons, comparing 2-model (Toto, Sundial), 3-model (+ Moirai-2), and 4-model (+ TabPFN-TS) configurations. Right: Design choices evaluation including optimization methods (L-BFGS-B), cross-validation strategies (Motifs), LLM backends (Claude-3.5-Sonnet, GPT-4o), and additional metrics (CRPS). Lower rank indicates better performance.
  • Figure 4: LLM Vote Ensemble vs. Random Metric Ensemble. The plots compare $\Delta$MASE and $\Delta$CRPS. Green bars indicate domains where the LLM vote ensemble outperforms the random metric baseline. Notable improvements are observed in Healthcare (e.g., covid_deaths with $\Delta$MASE of -1.86) and Sales (e.g., car_parts with $\Delta$CRPS of -0.044). The dominance of the agent in these complex domains validates the necessity of reasoning-guided optimization.
  • Figure 5: Stability of LLM Vote Ensemble. The histograms show the variance of MASE and CRPS across 6 independent runs for 97 datasets.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Theorem 3.1: Ensemble Superiority Time Series forecasting
  • Theorem E.1: Ensemble Superiority Under Temporal Incompatibility