Table of Contents
Fetching ...

TiMi: Empower Time Series Transformers with Multimodal Mixture of Experts

Jiafeng Lin, Yuxuan Wang, Huakun Luo, Zhongyi Pei, Jianmin Wang

TL;DR

This paper introduces a Multimodal Mixture-of-Experts (MMoE) module as a lightweight plug-in to empower Transformer-based time series models for multimodal forecasting, eliminating the need for explicit representation-level alignment.

Abstract

Multimodal time series forecasting has garnered significant attention for its potential to provide more accurate predictions than traditional single-modality models by leveraging rich information inherent in other modalities. However, due to fundamental challenges in modality alignment, existing methods often struggle to effectively incorporate multimodal data into predictions, particularly textual information that has a causal influence on time series fluctuations, such as emergency reports and policy announcements. In this paper, we reflect on the role of textual information in numerical forecasting and propose Time series transformers with Multimodal Mixture-of-Experts, TiMi, to unleash the causal reasoning capabilities of LLMs. Concretely, TiMi utilizes LLMs to generate inferences on future developments, which serve as guidance for time series forecasting. To seamlessly integrate both exogenous factors and time series into predictions, we introduce a Multimodal Mixture-of-Experts (MMoE) module as a lightweight plug-in to empower Transformer-based time series models for multimodal forecasting, eliminating the need for explicit representation-level alignment. Experimentally, our proposed TiMi demonstrates consistent state-of-the-art performance on sixteen real-world multimodal forecasting benchmarks, outperforming advanced baselines while offering both strong adaptability and interpretability.

TiMi: Empower Time Series Transformers with Multimodal Mixture of Experts

TL;DR

This paper introduces a Multimodal Mixture-of-Experts (MMoE) module as a lightweight plug-in to empower Transformer-based time series models for multimodal forecasting, eliminating the need for explicit representation-level alignment.

Abstract

Multimodal time series forecasting has garnered significant attention for its potential to provide more accurate predictions than traditional single-modality models by leveraging rich information inherent in other modalities. However, due to fundamental challenges in modality alignment, existing methods often struggle to effectively incorporate multimodal data into predictions, particularly textual information that has a causal influence on time series fluctuations, such as emergency reports and policy announcements. In this paper, we reflect on the role of textual information in numerical forecasting and propose Time series transformers with Multimodal Mixture-of-Experts, TiMi, to unleash the causal reasoning capabilities of LLMs. Concretely, TiMi utilizes LLMs to generate inferences on future developments, which serve as guidance for time series forecasting. To seamlessly integrate both exogenous factors and time series into predictions, we introduce a Multimodal Mixture-of-Experts (MMoE) module as a lightweight plug-in to empower Transformer-based time series models for multimodal forecasting, eliminating the need for explicit representation-level alignment. Experimentally, our proposed TiMi demonstrates consistent state-of-the-art performance on sixteen real-world multimodal forecasting benchmarks, outperforming advanced baselines while offering both strong adaptability and interpretability.
Paper Structure (37 sections, 8 equations, 11 figures, 11 tables)

This paper contains 37 sections, 8 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Two types of multimodal time series data exist in real-world forecasting scenarios. Compared to series-metadata data, which is semantically aligned like vision-language data, series-textual data lacks this direct alignment but often offers richer and more informative insights for prediction.
  • Figure 2: Three categories of multimodal time series forecasting models distinguished through fusion.
  • Figure 3: Overall design of TiMi, a time series-centric Transformer model. Textual content is encoded by LLMs to infer future trends of predictions. The MMoE module integrates a global view of both historical input and causal knowledge derived from text, thereby enhancing the time series modeling.
  • Figure 4: Irregular multimodal time series forecasting results on Time-IMM datasets. The metric for comparison is MSE. We compared four methods using PatchTST as the temporal backbone, among which the baseline results of IMM-TSF and PatchTST were derived from Time-IMM chang2025time.
  • Figure 5: Multimodal time series forecasting results of replacing the LLM backbone. The time series backbone is fixed as PatchTST. The result is the average performance across four dataset-specific prediction horizons. For detailed results, please refer to the Appendix \ref{['app:replace-LLM']}.
  • ...and 6 more figures