Table of Contents
Fetching ...

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Mouxiang Chen, Lefei Shen, Zhuo Li, Xiaoyun Joy Wang, Jianling Sun, Chenghao Liu

TL;DR

VisionTS introduces a cross-modal time-series forecasting approach by repurposing a visual MAE pretrained on ImageNet to perform zero-shot forecasts. It maps look-back windows to visible image patches and forecast horizons to masked patches, enabling image-based reconstruction to predict future values. On eight long-term benchmarks and large zero-shot GIFT-Eval/Monash suites, VisionTS achieves competitive or state-of-the-art zero-shot performance and shows strong gains with minimal fine-tuning. Analyses reveal that some time series align closely with ImageNet distributions, supporting cross-modality transfer, while larger image backbones may overfit to image-specific cues. Limitations include handling multivariate interactions and distribution forecasting, suggesting diffusion-based or more advanced cross-modal architectures as future work.

Abstract

Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either repurpose large language models (LLMs) or build large-scale time series datasets to develop TSF foundation models for universal forecasting. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. This paper explores a new road to building a TSF foundation model from rich, high-quality natural images. Our key insight is that a visual masked autoencoder, pre-trained on the ImageNet dataset, can naturally be a numeric series forecaster. By reformulating TSF as an image reconstruction task, we bridge the gap between image pre-training and TSF downstream tasks. Surprisingly, without further adaptation in the time series domain, the proposed VisionTS could achieve better zero-shot forecast performance than existing TSF foundation models. With fine-tuning for one epoch, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. Extensive experiments reveal intrinsic similarities between images and real-world time series, suggesting that visual models may offer a "free lunch" for TSF and highlight the potential for future cross-modality research. Our code is publicly available at https://github.com/Keytoyze/VisionTS.

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

TL;DR

VisionTS introduces a cross-modal time-series forecasting approach by repurposing a visual MAE pretrained on ImageNet to perform zero-shot forecasts. It maps look-back windows to visible image patches and forecast horizons to masked patches, enabling image-based reconstruction to predict future values. On eight long-term benchmarks and large zero-shot GIFT-Eval/Monash suites, VisionTS achieves competitive or state-of-the-art zero-shot performance and shows strong gains with minimal fine-tuning. Analyses reveal that some time series align closely with ImageNet distributions, supporting cross-modality transfer, while larger image backbones may overfit to image-specific cues. Limitations include handling multivariate interactions and distribution forecasting, suggesting diffusion-based or more advanced cross-modal architectures as future work.

Abstract

Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either repurpose large language models (LLMs) or build large-scale time series datasets to develop TSF foundation models for universal forecasting. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. This paper explores a new road to building a TSF foundation model from rich, high-quality natural images. Our key insight is that a visual masked autoencoder, pre-trained on the ImageNet dataset, can naturally be a numeric series forecaster. By reformulating TSF as an image reconstruction task, we bridge the gap between image pre-training and TSF downstream tasks. Surprisingly, without further adaptation in the time series domain, the proposed VisionTS could achieve better zero-shot forecast performance than existing TSF foundation models. With fine-tuning for one epoch, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. Extensive experiments reveal intrinsic similarities between images and real-world time series, suggesting that visual models may offer a "free lunch" for TSF and highlight the potential for future cross-modality research. Our code is publicly available at https://github.com/Keytoyze/VisionTS.
Paper Structure (60 sections, 1 equation, 14 figures, 22 tables)

This paper contains 60 sections, 1 equation, 14 figures, 22 tables.

Figures (14)

  • Figure 1: Long-term forecasting (left) and GIFT-Eval (right) performance comparison. Our VisionTS, without any training on time series data, outperforms the pure time series foundation models in the zero-shot setting.
  • Figure 2: An image of the ImageNet dataset ImageNet, in which the pixel arrays can display many well-known features of real-world time series, such as trend, seasonality, and stationarity qiu2024tfb. By self-supervised pre-training on ImageNet, it is reasonable that a visual model could understand these features and exhibit a level of time series forecasting ability.
  • Figure 3: VisionTS architecture. The input is first segmented by period, rendered into a grayscale image, and then aligned with the visible patches on the left through resampling. MAE is used to predict the masked patches on the right, and the reconstructed image is then reversed to forecasting.
  • Figure 4: Performance on the GIFT-Eval Leaderboard (cut-off at VisionTS's release).
  • Figure 5: Aggregated results on the Monash TSF Benchmark, with full results in \ref{['tab:zero_shot_monash']} (\ref{['sec:app_zs_monash']}).
  • ...and 9 more figures