Table of Contents
Fetching ...

On Creating a Causally Grounded Usable Rating Method for Assessing the Robustness of Foundation Models Supporting Time Series

Kausik Lakkaraju, Rachneet Kaur, Parisa Zehtabi, Sunandita Patra, Siva Likitha Valluru, Zhen Zeng, Biplav Srivastava, Marco Valtorta

TL;DR

The paper tackles robustness assessment for foundation models applied to time-series forecasting by introducing a causally grounded rating framework. It defines a causal model linking input perturbations, a sensitive attribute, and forecast residuals, and operationalizes robustness with three perturbations (P0–P3) plus two new metrics (APE and PIE%) alongside established ones. Through zero-shot evaluations of four FMTS (including multi-modal variants) on stock data across six firms, the study finds that multi-modal and time-series–trained FMTS generally offer greater robustness and that the framework yields actionable guidance for model selection and deployment. A user study with 26 participants demonstrates that the ratings facilitate comparative judgment of robustness and, to a degree, fairness, supporting real-world usability. Overall, the work provides a practical, interpretable, and causal approach to certifying FMTS robustness in finance-related time-series tasks, with potential applicability to other domains requiring reliable forecasting under perturbations.

Abstract

Foundation Models (FMs) have improved time series forecasting in various sectors, such as finance, but their vulnerability to input disturbances can hinder their adoption by stakeholders, such as investors and analysts. To address this, we propose a causally grounded rating framework to study the robustness of Foundational Models for Time Series (FMTS) with respect to input perturbations. We evaluate our approach to the stock price prediction problem, a well-studied problem with easily accessible public data, evaluating six state-of-the-art (some multi-modal) FMTS across six prominent stocks spanning three industries. The ratings proposed by our framework effectively assess the robustness of FMTS and also offer actionable insights for model selection and deployment. Within the scope of our study, we find that (1) multi-modal FMTS exhibit better robustness and accuracy compared to their uni-modal versions and, (2) FMTS pre-trained on time series forecasting task exhibit better robustness and forecasting accuracy compared to general-purpose FMTS pre-trained across diverse settings. Further, to validate our framework's usability, we conduct a user study showcasing FMTS prediction errors along with our computed ratings. The study confirmed that our ratings reduced the difficulty for users in comparing the robustness of different systems.

On Creating a Causally Grounded Usable Rating Method for Assessing the Robustness of Foundation Models Supporting Time Series

TL;DR

The paper tackles robustness assessment for foundation models applied to time-series forecasting by introducing a causally grounded rating framework. It defines a causal model linking input perturbations, a sensitive attribute, and forecast residuals, and operationalizes robustness with three perturbations (P0–P3) plus two new metrics (APE and PIE%) alongside established ones. Through zero-shot evaluations of four FMTS (including multi-modal variants) on stock data across six firms, the study finds that multi-modal and time-series–trained FMTS generally offer greater robustness and that the framework yields actionable guidance for model selection and deployment. A user study with 26 participants demonstrates that the ratings facilitate comparative judgment of robustness and, to a degree, fairness, supporting real-world usability. Overall, the work provides a practical, interpretable, and causal approach to certifying FMTS robustness in finance-related time-series tasks, with potential applicability to other domains requiring reliable forecasting under perturbations.

Abstract

Foundation Models (FMs) have improved time series forecasting in various sectors, such as finance, but their vulnerability to input disturbances can hinder their adoption by stakeholders, such as investors and analysts. To address this, we propose a causally grounded rating framework to study the robustness of Foundational Models for Time Series (FMTS) with respect to input perturbations. We evaluate our approach to the stock price prediction problem, a well-studied problem with easily accessible public data, evaluating six state-of-the-art (some multi-modal) FMTS across six prominent stocks spanning three industries. The ratings proposed by our framework effectively assess the robustness of FMTS and also offer actionable insights for model selection and deployment. Within the scope of our study, we find that (1) multi-modal FMTS exhibit better robustness and accuracy compared to their uni-modal versions and, (2) FMTS pre-trained on time series forecasting task exhibit better robustness and forecasting accuracy compared to general-purpose FMTS pre-trained across diverse settings. Further, to validate our framework's usability, we conduct a user study showcasing FMTS prediction errors along with our computed ratings. The study confirmed that our ratings reduced the difficulty for users in comparing the robustness of different systems.

Paper Structure

This paper contains 28 sections, 5 equations, 19 figures, 6 tables, 4 algorithms.

Figures (19)

  • Figure 1: Causal model $\mathcal{M}$ for FMTS. The validity of link '1' depends on the data distribution ($P|Z$), while the validity of the links '2' and '3' are tested in our experiments.
  • Figure 2: Variants of the causal diagram in Figure \ref{['fig:causal-model']} used to answer different research questions (RQs).
  • Figure 3: (a) Black arrows denote the unperturbed and red arrows indicate the perturbed paths. Dashed lines shows the multi-modal path. The perturbed parts of the plots are highlighted in red. (b) Workflow for performing statistical and causal analysis to compute raw scores and assign final ratings to the test systems
  • Figure 4: Studying each metric with respect to impact of company and industry as confounders for all models and all perturbations. Plotted in double logarithmic scale, lower values indicate better robustness. Ratings generated by our method (with $L=3$) are shown on the top of each plot. The complete final order (with ratings) are shown in Table \ref{['tab:ratings']} in Appendix \ref{['sec:appendix-experiments']}.
  • Figure 5: Effect of the modalities for $S_g$ (left) and $S_p$ (right).
  • ...and 14 more figures