How Can Time Series Analysis Benefit From Multiple Modalities? A Survey and Outlook
Haoxin Liu, Harshavardhan Kamarthi, Zhiyuan Zhao, Shangqing Xu, Shiyu Wang, Qingsong Wen, Tom Hartvigsen, Fei Wang, B. Aditya Prakash
TL;DR
This paper presents the first comprehensive survey of Multiple Modalities for Time Series Analysis (MM4TSA), outlining three core approaches: TimeAsX (reusing foundation models from other modalities), Time+X (multimodal extensions), and Time2X/X2Time (cross-modality interaction). It categorizes work by modality (text, image, audio, table) and domain (finance, medicine, spatial-temporal), discusses practical datasets and fusion strategies, and identifies key gaps such as modality selection, heterogeneous integration, and unseen-task generalization. The authors propose benchmarks and reasoning-based methods to advance the field and provide an up-to-date GitHub resource with papers and datasets. Overall, the survey highlights the value of leveraging multi-modal information to enhance TSA performance, interpretability, and applicability across diverse domains.
Abstract
Time series analysis (TSA) is a longstanding research topic in the data mining community and has wide real-world significance. Compared to "richer" modalities such as language and vision, which have recently experienced explosive development and are densely connected, the time-series modality remains relatively underexplored and isolated. We notice that many recent TSA works have formed a new research field, i.e., Multiple Modalities for TSA (MM4TSA). In general, these MM4TSA works follow a common motivation: how TSA can benefit from multiple modalities. This survey is the first to offer a comprehensive review and a detailed outlook for this emerging field. Specifically, we systematically discuss three benefits: (1) reusing foundation models of other modalities for efficient TSA, (2) multimodal extension for enhanced TSA, and (3) cross-modality interaction for advanced TSA. We further group the works by the introduced modality type, including text, images, audio, tables, and others, within each perspective. Finally, we identify the gaps with future opportunities, including the reused modalities selections, heterogeneous modality combinations, and unseen tasks generalizations, corresponding to the three benefits. We release an up-to-date GitHub repository that includes key papers and resources.
