VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters
Mouxiang Chen, Lefei Shen, Zhuo Li, Xiaoyun Joy Wang, Jianling Sun, Chenghao Liu
TL;DR
VisionTS introduces a cross-modal time-series forecasting approach by repurposing a visual MAE pretrained on ImageNet to perform zero-shot forecasts. It maps look-back windows to visible image patches and forecast horizons to masked patches, enabling image-based reconstruction to predict future values. On eight long-term benchmarks and large zero-shot GIFT-Eval/Monash suites, VisionTS achieves competitive or state-of-the-art zero-shot performance and shows strong gains with minimal fine-tuning. Analyses reveal that some time series align closely with ImageNet distributions, supporting cross-modality transfer, while larger image backbones may overfit to image-specific cues. Limitations include handling multivariate interactions and distribution forecasting, suggesting diffusion-based or more advanced cross-modal architectures as future work.
Abstract
Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either repurpose large language models (LLMs) or build large-scale time series datasets to develop TSF foundation models for universal forecasting. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. This paper explores a new road to building a TSF foundation model from rich, high-quality natural images. Our key insight is that a visual masked autoencoder, pre-trained on the ImageNet dataset, can naturally be a numeric series forecaster. By reformulating TSF as an image reconstruction task, we bridge the gap between image pre-training and TSF downstream tasks. Surprisingly, without further adaptation in the time series domain, the proposed VisionTS could achieve better zero-shot forecast performance than existing TSF foundation models. With fine-tuning for one epoch, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. Extensive experiments reveal intrinsic similarities between images and real-world time series, suggesting that visual models may offer a "free lunch" for TSF and highlight the potential for future cross-modality research. Our code is publicly available at https://github.com/Keytoyze/VisionTS.
