Table of Contents
Fetching ...

Aurora: Towards Universal Generative Multimodal Time Series Forecasting

Xingjian Wu, Jianxin Jin, Wanghui Qiu, Peng Chen, Yang Shu, Bin Yang, Chenjuan Guo

TL;DR

Aurora tackles cross-domain time series forecasting by learning a cross-domain multimodal foundation model trained on a Cross-Domain Multimodal Time Series Corpus, enabling zero-shot cross-domain inference. It combines a Multimodal Tokenization/Distillation/Alignment encoding stage with a Condition-Decoder and a Prototype-Guided Flow Matching decoder to produce generative probabilistic forecasts that leverage domain knowledge from text and endogenous images. The approach achieves state-of-the-art results across unimodal and multimodal benchmarks (TimeMMD, TSFM-Bench, ProbTS), with extensive ablations validating the necessity of each component and analyses confirming scalable inference. This work significantly broadens the applicability of time series foundation models by enabling robust cross-domain forecasting through multimodal knowledge integration.

Abstract

Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on well-recognized benchmarks, including TimeMMD, TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.

Aurora: Towards Universal Generative Multimodal Time Series Forecasting

TL;DR

Aurora tackles cross-domain time series forecasting by learning a cross-domain multimodal foundation model trained on a Cross-Domain Multimodal Time Series Corpus, enabling zero-shot cross-domain inference. It combines a Multimodal Tokenization/Distillation/Alignment encoding stage with a Condition-Decoder and a Prototype-Guided Flow Matching decoder to produce generative probabilistic forecasts that leverage domain knowledge from text and endogenous images. The approach achieves state-of-the-art results across unimodal and multimodal benchmarks (TimeMMD, TSFM-Bench, ProbTS), with extensive ablations validating the necessity of each component and analyses confirming scalable inference. This work significantly broadens the applicability of time series foundation models by enabling robust cross-domain forecasting through multimodal knowledge integration.

Abstract

Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on well-recognized benchmarks, including TimeMMD, TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.

Paper Structure

This paper contains 33 sections, 14 equations, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: Aurora is pretrained on cross-domain multimodal time series corpus, supporting both text and image information to enhance zero-shot time series forecasting.
  • Figure 2: The overview of Aurora.
  • Figure 3: Prototype-Guided Flow Matching. The starting point is set as a prototype instead of a random gaussian noise, which provides an intuitive guidance in generation process.
  • Figure 4: Evaluation summary of Aurora.
  • Figure 5: Sampled Predictions.
  • ...and 2 more figures