Empowering Time Series Analysis with Large-Scale Multimodal Pretraining
Peng Chen, Siyuan Wang, Shiyan Hu, Xingjian Wu, Yang Shu, Zhongwen Rao, Meng Wang, Yijie Li, Bin Yang, Chenjuan Guo
TL;DR
This work tackles the limitation of unimodal time series models by introducing a multimodal pretraining framework for time series. It presents MM-TS, the first large-scale multimodal time series dataset spanning six domains with text and images, and HORAI, a frequency-enhanced multimodal foundation model featuring a Frequency-enhanced Cross-Modality Encoder and a Time-Frequency Decoder. Pretrained on MM-TS, HORAI achieves state-of-the-art zero-shot performance in time series forecasting and anomaly detection across diverse domains, demonstrating strong cross-modal and cross-domain generalization. The approach underscores the practical potential of leveraging endogenous and exogenous modalities for robust, scalable time series understanding.
Abstract
While existing time series foundation models primarily rely on large-scale unimodal pretraining, they lack complementary modalities to enhance time series understanding. Building multimodal foundation models is a natural next step, but it faces key challenges: 1) lack of a unified multimodal pretraining paradigm and large-scale multimodal corpora for time series analysis; 2) how to effectively integrate heterogeneous modalities and enhance model generalization. To address these challenges, we take an early step toward multimodal foundation models for time series analysis. We first propose a multimodal pretraining paradigm that leverages time series with endogenous modalities (derived images and text) and exogenous knowledge (real-world news), providing a comprehensive multi-view perspective for time series analysis. To support this, we develop an automated data construction pipeline to curate MM-TS, the first large-scale multimodal time series dataset spanning six domains, with up to one billion points. Then we propose HORAI, a frequency-enhanced multimodal foundation model. It integrates two core components: the Frequency-enhanced Cross-Modality Encoder and the Time-Frequency Decoder, designed to effectively fuse multimodal features and enhance model generalization across modalities and domains. After pretraining on MM-TS, HORAI achieves state-of-the-art zero-shot performance on time series forecasting and anomaly detection tasks, demonstrating strong generalization.
