Table of Contents
Fetching ...

VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones

Lefei Shen, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren, Jianling Sun, Zhuo Li, Chenghao Liu

TL;DR

This work addresses the fragmentation of time-series forecasting by leveraging vision foundation models through continual pre-training on large-scale time-series data. It introduces three innovations—vision-model-based filtering, colorized multivariate conversion, and multi-quantile forecasting—to bridge modality, multivariate, and probabilistic gaps, enabling effective cross-modal transfer. Empirical results show state-of-the-art performance on in-distribution (Monash) and out-of-distribution (LTSF, PF, GIFT-Eval) benchmarks, including significant MSE reductions and top rankings, demonstrating strong generalization. The findings suggest that vision priors, properly adapted, can yield universal time-series foundation models with broad applicability and robustness.

Abstract

Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) the multivariate-forecasting gap between fixed RGB-three-channel vision models and time series with arbitrary numbers of variates; and (3) the probabilistic-forecasting gap between the deterministic outputs of vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisonTS++, a TSFM based on continual pre-training of a vision model on large-scale time series. Our approach introduces three key innovations: (1) vision-model-based filtering to identify high-quality sequences to stabilize pre-training and mitigate modality gap; (2) colorized multivariate conversion, encoding multivariate series as multi-subfigure RGB images to enhance cross-variate modeling; (3) multi-quantile forecasting, using parallel reconstruction heads to generate quantile forecasts without parametric assumptions. Experiments show that VisionTS++ achieves state-of-the-art performance in both in-distribution and out-of-distribution forecasting, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in GIFT-Eval benchmark which comprises 23 datasets across 7 domains. Our work demonstrates that with appropriate adaptation, vision models can effectively generalize to TSF, thus advancing the pursuit of universal TSFMs. Code is available at https://github.com/HALF111/VisionTSpp.

VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones

TL;DR

This work addresses the fragmentation of time-series forecasting by leveraging vision foundation models through continual pre-training on large-scale time-series data. It introduces three innovations—vision-model-based filtering, colorized multivariate conversion, and multi-quantile forecasting—to bridge modality, multivariate, and probabilistic gaps, enabling effective cross-modal transfer. Empirical results show state-of-the-art performance on in-distribution (Monash) and out-of-distribution (LTSF, PF, GIFT-Eval) benchmarks, including significant MSE reductions and top rankings, demonstrating strong generalization. The findings suggest that vision priors, properly adapted, can yield universal time-series foundation models with broad applicability and robustness.

Abstract

Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) the multivariate-forecasting gap between fixed RGB-three-channel vision models and time series with arbitrary numbers of variates; and (3) the probabilistic-forecasting gap between the deterministic outputs of vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisonTS++, a TSFM based on continual pre-training of a vision model on large-scale time series. Our approach introduces three key innovations: (1) vision-model-based filtering to identify high-quality sequences to stabilize pre-training and mitigate modality gap; (2) colorized multivariate conversion, encoding multivariate series as multi-subfigure RGB images to enhance cross-variate modeling; (3) multi-quantile forecasting, using parallel reconstruction heads to generate quantile forecasts without parametric assumptions. Experiments show that VisionTS++ achieves state-of-the-art performance in both in-distribution and out-of-distribution forecasting, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in GIFT-Eval benchmark which comprises 23 datasets across 7 domains. Our work demonstrates that with appropriate adaptation, vision models can effectively generalize to TSF, thus advancing the pursuit of universal TSFMs. Code is available at https://github.com/HALF111/VisionTSpp.

Paper Structure

This paper contains 38 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Left: Training pipeline of VisionTS++. We perform continual pre-training of a visual backbone (MAE) on large-scale time series datasets to create a powerful and universal TSFM. Right: VisionTS++ outperforms Moirai and VisionTS in both multivariate and probabilistic forecasting, demonstrating its superior effectiveness.
  • Figure 2: Overview of VisionTS++. For each input, the following pipeline is applied: (1) Samples with out-of-range values after normalization are filtered out; (2) Each variate is segmented by periodicity and rendered as a colored subfigure, forming a composite image; (3) Multiple quantile forecasts are generated via parallel reconstruction heads. The model conducts continual pre-training on such transformed time series data to adapt MAE for universal forecasting.
  • Figure 3: Normalized MAE results on Monash Benchmark, with full results in Table \ref{['tab:monash_full']} (Appendix \ref{['subsec:app_monash_full']}). Model sizes are denoted as: s (small), b (base), l (large).
  • Figure 4: Ranks on GIFT-Eval Benchmark (cut-off at 2025/08).
  • Figure 5: Forecasting visualization on a sample from ETTm1. (a-b) Input/Output images of VisionTS++. (c-d) Prediction comparison between VisionTS++ and VisionTS.
  • ...and 1 more figures