Aurora: Towards Universal Generative Multimodal Time Series Forecasting
Xingjian Wu, Jianxin Jin, Wanghui Qiu, Peng Chen, Yang Shu, Bin Yang, Chenjuan Guo
TL;DR
Aurora tackles cross-domain time series forecasting by learning a cross-domain multimodal foundation model trained on a Cross-Domain Multimodal Time Series Corpus, enabling zero-shot cross-domain inference. It combines a Multimodal Tokenization/Distillation/Alignment encoding stage with a Condition-Decoder and a Prototype-Guided Flow Matching decoder to produce generative probabilistic forecasts that leverage domain knowledge from text and endogenous images. The approach achieves state-of-the-art results across unimodal and multimodal benchmarks (TimeMMD, TSFM-Bench, ProbTS), with extensive ablations validating the necessity of each component and analyses confirming scalable inference. This work significantly broadens the applicability of time series foundation models by enabling robust cross-domain forecasting through multimodal knowledge integration.
Abstract
Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on well-recognized benchmarks, including TimeMMD, TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.
