Table of Contents
Fetching ...

ViTime: Foundation Model for Time Series Forecasting Powered by Vision Intelligence

Luoxiao Yang, Yun Wang, Xinqi Fan, Israel Cohen, Jingdong Chen, Zijun Zhang

TL;DR

ViTime introduces a vision-based TSF foundation model that operates in a binary image space, transforming numerical time series via a mapping f:S→V and leveraging Earth Mover’s Distance–style metrics to quantify similarity. A key innovation is RealTS, a synthetic data generator that emphasizes fundamental trend and periodic components to enable robust cross-domain generalization. The framework provides rigorous quantization-error bounds, optimal-MS guidance, and SNR advantages for visual representations, along with a ViTime architecture consisting of a Visual Time Tokenizer, Decoder, and Refining Module. With zero-shot, few-shot fine-tuning, and robustness experiments across seven public datasets, ViTime achieves state-of-the-art performance in point and probabilistic forecasting, demonstrating strong scale-robust generalization and resilience to missing data and perturbations. The work also outlines practical limitations and future directions, including adaptive resolutions and richer synthetic data, underscoring the potential of vision-informed approaches for universal TSF tasks.

Abstract

Time series forecasting (TSF) possesses great practical values in various fields, including power and energy, transportation, etc. TSF methods have been studied based on knowledge from classical statistics to modern deep learning. Yet, all of them were developed based on one fundamental concept, the numerical data fitting. Thus, the models developed have long been known to be problem-specific and lacking application generalizability. Practitioners expect a TSF foundation model that serves TSF tasks in different applications. The central question is then how to develop such a TSF foundation model. This paper offers one pioneering study in the TSF foundation model development method and proposes a vision intelligence-powered framework, ViTime, for the first time. ViTime fundamentally shifts TSF from numerical fitting to operations based on a binary image-based time series metric space and naturally supports both point and probabilistic forecasting. We also provide rigorous theoretical analyses of ViTime, including quantization-induced system error bounds and principled strategies for optimal parameter selection. Furthermore, we propose RealTS, an innovative synthesis algorithm generating diverse and realistic training samples, effectively enriching the training data and significantly enhancing model generalizability. Extensive experiments demonstrate ViTime's state-of-the-art performance. In zero-shot scenarios, ViTime outperforms TimesFM by 9-15\%. With just 10\% fine-tuning data, ViTime surpasses both leading foundation models and fully-supervised benchmarks, a gap that widens with 100\% fine-tuning. ViTime also exhibits exceptional robustness, effectively handling missing data and outperforming TimesFM by 20-30\% under various data perturbations, validating the power of its visual space data operation paradigm.

ViTime: Foundation Model for Time Series Forecasting Powered by Vision Intelligence

TL;DR

ViTime introduces a vision-based TSF foundation model that operates in a binary image space, transforming numerical time series via a mapping f:S→V and leveraging Earth Mover’s Distance–style metrics to quantify similarity. A key innovation is RealTS, a synthetic data generator that emphasizes fundamental trend and periodic components to enable robust cross-domain generalization. The framework provides rigorous quantization-error bounds, optimal-MS guidance, and SNR advantages for visual representations, along with a ViTime architecture consisting of a Visual Time Tokenizer, Decoder, and Refining Module. With zero-shot, few-shot fine-tuning, and robustness experiments across seven public datasets, ViTime achieves state-of-the-art performance in point and probabilistic forecasting, demonstrating strong scale-robust generalization and resilience to missing data and perturbations. The work also outlines practical limitations and future directions, including adaptive resolutions and richer synthetic data, underscoring the potential of vision-informed approaches for universal TSF tasks.

Abstract

Time series forecasting (TSF) possesses great practical values in various fields, including power and energy, transportation, etc. TSF methods have been studied based on knowledge from classical statistics to modern deep learning. Yet, all of them were developed based on one fundamental concept, the numerical data fitting. Thus, the models developed have long been known to be problem-specific and lacking application generalizability. Practitioners expect a TSF foundation model that serves TSF tasks in different applications. The central question is then how to develop such a TSF foundation model. This paper offers one pioneering study in the TSF foundation model development method and proposes a vision intelligence-powered framework, ViTime, for the first time. ViTime fundamentally shifts TSF from numerical fitting to operations based on a binary image-based time series metric space and naturally supports both point and probabilistic forecasting. We also provide rigorous theoretical analyses of ViTime, including quantization-induced system error bounds and principled strategies for optimal parameter selection. Furthermore, we propose RealTS, an innovative synthesis algorithm generating diverse and realistic training samples, effectively enriching the training data and significantly enhancing model generalizability. Extensive experiments demonstrate ViTime's state-of-the-art performance. In zero-shot scenarios, ViTime outperforms TimesFM by 9-15\%. With just 10\% fine-tuning data, ViTime surpasses both leading foundation models and fully-supervised benchmarks, a gap that widens with 100\% fine-tuning. ViTime also exhibits exceptional robustness, effectively handling missing data and outperforming TimesFM by 20-30\% under various data perturbations, validating the power of its visual space data operation paradigm.
Paper Structure (70 sections, 12 theorems, 90 equations, 36 figures, 23 tables)

This paper contains 70 sections, 12 theorems, 90 equations, 36 figures, 23 tables.

Key Result

Theorem 3.3

Given a tensor $\widehat{s} \in \mathcal{S} \subset \mathbb{R}^{c \times L}$, the system error defined as $\left\| f^{-1}\left( \mathbf{f}\left( \widehat{s} \right) \right) - \widehat{s} \right\|_{1}$ satisfies the following bound: where $\Phi$ denotes the cumulative distribution function of $N(\mathbf{0,I})$.

Figures (36)

  • Figure 1: ViTime architecture overview. (a) Pipeline comparison between ViTime and traditional numerical TSF models, showing ViTime's paradigm shift to binary image space processing. (b) ViTime network with three modules: Visual Time Tokenizer, Decoder, and Refining Module. (c) Complete architecture: RealTS synthesis for diverse training samples, mapping function for numerical-to-binary conversion, ViTime model for visual pattern learning, and inverse mapping for prediction output, enabling zero-shot generalization across real-world time series tasks.
  • Figure 2: Radar plots comparing the average MAE of ViTime and TimesFM across different rescale factors. The radial axis represents MAE, with lower values (larger radius) indicating better performance. Each axis corresponds to a specific rescale factor.
  • Figure 3: Performance with different fine-tuning data proportion.
  • Figure 4: Performance comparison of ViTime versus TimesFM on TSF tasks under various data perturbations: a. Original time series. b. Time series with noises injected. c. Time series with harmonic added. d. Time series with missing data.
  • Figure 5: Robustness analysis under increasing Gaussian noise levels.
  • ...and 31 more figures

Theorems & Definitions (19)

  • Definition 3.1: Binary image-based time series metric space
  • Theorem 3.3: System Error Upper Bound
  • Proposition 3.4: Asymptotic Convergence with $h$
  • Proposition 3.5: Optimal MS Selection
  • Proposition 3.6: Optimal Threshold under Variance Scaling
  • Theorem 3.7: Stripe SNR Boost
  • Theorem 3.8: Gaussian Blur SNR Boost
  • Theorem C.1: \ref{['theoremSEB']} restated
  • proof
  • Proposition C.2: Proposition \ref{['proposition1']} restated
  • ...and 9 more