On Creating a Causally Grounded Usable Rating Method for Assessing the Robustness of Foundation Models Supporting Time Series
Kausik Lakkaraju, Rachneet Kaur, Parisa Zehtabi, Sunandita Patra, Siva Likitha Valluru, Zhen Zeng, Biplav Srivastava, Marco Valtorta
TL;DR
The paper tackles robustness assessment for foundation models applied to time-series forecasting by introducing a causally grounded rating framework. It defines a causal model linking input perturbations, a sensitive attribute, and forecast residuals, and operationalizes robustness with three perturbations (P0–P3) plus two new metrics (APE and PIE%) alongside established ones. Through zero-shot evaluations of four FMTS (including multi-modal variants) on stock data across six firms, the study finds that multi-modal and time-series–trained FMTS generally offer greater robustness and that the framework yields actionable guidance for model selection and deployment. A user study with 26 participants demonstrates that the ratings facilitate comparative judgment of robustness and, to a degree, fairness, supporting real-world usability. Overall, the work provides a practical, interpretable, and causal approach to certifying FMTS robustness in finance-related time-series tasks, with potential applicability to other domains requiring reliable forecasting under perturbations.
Abstract
Foundation Models (FMs) have improved time series forecasting in various sectors, such as finance, but their vulnerability to input disturbances can hinder their adoption by stakeholders, such as investors and analysts. To address this, we propose a causally grounded rating framework to study the robustness of Foundational Models for Time Series (FMTS) with respect to input perturbations. We evaluate our approach to the stock price prediction problem, a well-studied problem with easily accessible public data, evaluating six state-of-the-art (some multi-modal) FMTS across six prominent stocks spanning three industries. The ratings proposed by our framework effectively assess the robustness of FMTS and also offer actionable insights for model selection and deployment. Within the scope of our study, we find that (1) multi-modal FMTS exhibit better robustness and accuracy compared to their uni-modal versions and, (2) FMTS pre-trained on time series forecasting task exhibit better robustness and forecasting accuracy compared to general-purpose FMTS pre-trained across diverse settings. Further, to validate our framework's usability, we conduct a user study showcasing FMTS prediction errors along with our computed ratings. The study confirmed that our ratings reduced the difficulty for users in comparing the robustness of different systems.
