Table of Contents
Fetching ...

When Does Multimodality Lead to Better Time Series Forecasting?

Xiyuan Zhang, Boran Han, Haoyang Fang, Abdul Fatir Ansari, Shuai Zhang, Danielle C. Maddix, Cuixiong Hu, Andrew Gordon Wilson, Michael W. Mahoney, Hao Wang, Yan Liu, Huzefa Rangwala, George Karypis, Bernie Wang

TL;DR

This work systematically evaluates multimodal time series forecasting (MMTS) across 16 real-world datasets using aligning-based and prompting-based paradigms. It shows that multimodal gains are conditional, with benefits contingent on text model capacity, time-series backbone strength, alignment design, data quantity, and the novelty of information text provides beyond the series. The study provides data-agnostic guidelines for when to use MMTS, highlighting that gains are not universal and that careful architectural and data considerations are essential. By delivering a rigorous benchmark and mechanistic insights, the paper clarifies the practical impact of incorporating text into time series forecasting and points to future directions for more effective multimodal approaches.

Abstract

Recently, there has been growing interest in incorporating textual information into foundation models for time series forecasting. However, it remains unclear whether and under what conditions such multimodal integration consistently yields gains. We systematically investigate these questions across a diverse benchmark of 16 forecasting tasks spanning 7 domains, including health, environment, and economics. We evaluate two popular multimodal forecasting paradigms: aligning-based methods, which align time series and text representations; and prompting-based methods, which directly prompt large language models for forecasting. Our findings reveal that the benefits of multimodality are highly condition-dependent. While we confirm reported gains in some settings, these improvements are not universal across datasets or models. To move beyond empirical observations, we disentangle the effects of model architectural properties and data characteristics, drawing data-agnostic insights that generalize across domains. Our findings highlight that on the modeling side, incorporating text information is most helpful given (1) high-capacity text models, (2) comparatively weaker time series models, and (3) appropriate aligning strategies. On the data side, performance gains are more likely when (4) sufficient training data is available and (5) the text offers complementary predictive signal beyond what is already captured from the time series alone. Our study offers a rigorous, quantitative foundation for understanding when multimodality can be expected to aid forecasting tasks, and reveals that its benefits are neither universal nor always aligned with intuition.

When Does Multimodality Lead to Better Time Series Forecasting?

TL;DR

This work systematically evaluates multimodal time series forecasting (MMTS) across 16 real-world datasets using aligning-based and prompting-based paradigms. It shows that multimodal gains are conditional, with benefits contingent on text model capacity, time-series backbone strength, alignment design, data quantity, and the novelty of information text provides beyond the series. The study provides data-agnostic guidelines for when to use MMTS, highlighting that gains are not universal and that careful architectural and data considerations are essential. By delivering a rigorous benchmark and mechanistic insights, the paper clarifies the practical impact of incorporating text into time series forecasting and points to future directions for more effective multimodal approaches.

Abstract

Recently, there has been growing interest in incorporating textual information into foundation models for time series forecasting. However, it remains unclear whether and under what conditions such multimodal integration consistently yields gains. We systematically investigate these questions across a diverse benchmark of 16 forecasting tasks spanning 7 domains, including health, environment, and economics. We evaluate two popular multimodal forecasting paradigms: aligning-based methods, which align time series and text representations; and prompting-based methods, which directly prompt large language models for forecasting. Our findings reveal that the benefits of multimodality are highly condition-dependent. While we confirm reported gains in some settings, these improvements are not universal across datasets or models. To move beyond empirical observations, we disentangle the effects of model architectural properties and data characteristics, drawing data-agnostic insights that generalize across domains. Our findings highlight that on the modeling side, incorporating text information is most helpful given (1) high-capacity text models, (2) comparatively weaker time series models, and (3) appropriate aligning strategies. On the data side, performance gains are more likely when (4) sufficient training data is available and (5) the text offers complementary predictive signal beyond what is already captured from the time series alone. Our study offers a rigorous, quantitative foundation for understanding when multimodality can be expected to aid forecasting tasks, and reveals that its benefits are neither universal nor always aligned with intuition.

Paper Structure

This paper contains 37 sections, 2 equations, 28 figures, 20 tables.

Figures (28)

  • Figure 1: We systematically study when MMTS is effective from modeling and data perspectives, across 16 real-world datasets covering 7 domains. Our findings highlight that MMTS is effective given (1) larger text models, (2) weaker time series models, (3) appropriate aligning strategies, (4) sufficient training data, and (5) complimentary text information.
  • Figure 2: Comparison of unimodal (left) and two MMTS methods (middle, right): aligning-based and prompting-based methods. Aligning-based MMTS aligns time series and text representations; and prompting-based MMTS directly prompts LLMs for forecasting.
  • Figure 3: Prompting-based performance vs. MMLU-Pro. The y-axis shows MAE (left) and MSE (right), and the colormap shows the average ranking. Details in Table \ref{['tab:main']} and Table \ref{['tab:llm-expanded']}.
  • Figure 4: Aligning-based performance with varying text models. The y-axis shows MAE (left) and MSE (right), and the colormap shows the average ranking. Details in Table \ref{['tab:align-model-size']} in Appendix \ref{['sec:align-result']}.
  • Figure 5: The percentage change in MAE and MSE when comparing non-reasoning models to reasoning models across 9 datasets. Positive values indicate non-reasoning models perform better. Details in Table \ref{['tab:thinking']} in Appendix \ref{['sec:prompt-result']}.
  • ...and 23 more figures