Table of Contents
Fetching ...

Empowering Time Series Analysis with Large-Scale Multimodal Pretraining

Peng Chen, Siyuan Wang, Shiyan Hu, Xingjian Wu, Yang Shu, Zhongwen Rao, Meng Wang, Yijie Li, Bin Yang, Chenjuan Guo

TL;DR

This work tackles the limitation of unimodal time series models by introducing a multimodal pretraining framework for time series. It presents MM-TS, the first large-scale multimodal time series dataset spanning six domains with text and images, and HORAI, a frequency-enhanced multimodal foundation model featuring a Frequency-enhanced Cross-Modality Encoder and a Time-Frequency Decoder. Pretrained on MM-TS, HORAI achieves state-of-the-art zero-shot performance in time series forecasting and anomaly detection across diverse domains, demonstrating strong cross-modal and cross-domain generalization. The approach underscores the practical potential of leveraging endogenous and exogenous modalities for robust, scalable time series understanding.

Abstract

While existing time series foundation models primarily rely on large-scale unimodal pretraining, they lack complementary modalities to enhance time series understanding. Building multimodal foundation models is a natural next step, but it faces key challenges: 1) lack of a unified multimodal pretraining paradigm and large-scale multimodal corpora for time series analysis; 2) how to effectively integrate heterogeneous modalities and enhance model generalization. To address these challenges, we take an early step toward multimodal foundation models for time series analysis. We first propose a multimodal pretraining paradigm that leverages time series with endogenous modalities (derived images and text) and exogenous knowledge (real-world news), providing a comprehensive multi-view perspective for time series analysis. To support this, we develop an automated data construction pipeline to curate MM-TS, the first large-scale multimodal time series dataset spanning six domains, with up to one billion points. Then we propose HORAI, a frequency-enhanced multimodal foundation model. It integrates two core components: the Frequency-enhanced Cross-Modality Encoder and the Time-Frequency Decoder, designed to effectively fuse multimodal features and enhance model generalization across modalities and domains. After pretraining on MM-TS, HORAI achieves state-of-the-art zero-shot performance on time series forecasting and anomaly detection tasks, demonstrating strong generalization.

Empowering Time Series Analysis with Large-Scale Multimodal Pretraining

TL;DR

This work tackles the limitation of unimodal time series models by introducing a multimodal pretraining framework for time series. It presents MM-TS, the first large-scale multimodal time series dataset spanning six domains with text and images, and HORAI, a frequency-enhanced multimodal foundation model featuring a Frequency-enhanced Cross-Modality Encoder and a Time-Frequency Decoder. Pretrained on MM-TS, HORAI achieves state-of-the-art zero-shot performance in time series forecasting and anomaly detection across diverse domains, demonstrating strong cross-modal and cross-domain generalization. The approach underscores the practical potential of leveraging endogenous and exogenous modalities for robust, scalable time series understanding.

Abstract

While existing time series foundation models primarily rely on large-scale unimodal pretraining, they lack complementary modalities to enhance time series understanding. Building multimodal foundation models is a natural next step, but it faces key challenges: 1) lack of a unified multimodal pretraining paradigm and large-scale multimodal corpora for time series analysis; 2) how to effectively integrate heterogeneous modalities and enhance model generalization. To address these challenges, we take an early step toward multimodal foundation models for time series analysis. We first propose a multimodal pretraining paradigm that leverages time series with endogenous modalities (derived images and text) and exogenous knowledge (real-world news), providing a comprehensive multi-view perspective for time series analysis. To support this, we develop an automated data construction pipeline to curate MM-TS, the first large-scale multimodal time series dataset spanning six domains, with up to one billion points. Then we propose HORAI, a frequency-enhanced multimodal foundation model. It integrates two core components: the Frequency-enhanced Cross-Modality Encoder and the Time-Frequency Decoder, designed to effectively fuse multimodal features and enhance model generalization across modalities and domains. After pretraining on MM-TS, HORAI achieves state-of-the-art zero-shot performance on time series forecasting and anomaly detection tasks, demonstrating strong generalization.
Paper Structure (40 sections, 7 equations, 9 figures, 11 tables)

This paper contains 40 sections, 7 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Left: The large-scale multimodal time series dataset MM-TS is characterized by its coverage of various modalities, heterogeneous domains, and diverse temporal patterns. Right: The multimodal foundation model HORAI is pre-trained on the MM-TS dataset and evaluated on downstream scenarios and tasks.
  • Figure 2: The automated data construction pipeline for multimodal text. It comprises two main stages: 1) Contextual Synthesis, involving endogenous pattern analysis and exogenous news retrieval; and 2) Quality Alignment, which ensures consistency between the synthesized texts and filters low-quality data via an LLM judger.
  • Figure 3: The framework of the proposed HORAI consists of a Frequency-Enhanced Cross-Modal Encoder (gray region) and a Time-Frequency Decoder (blue region).
  • Figure 4: Ablation study on the Social Good dataset and the Energy dataset.
  • Figure 5: Fine-tuning HORAI with different data percentages on the Environment dataset.
  • ...and 4 more figures