Table of Contents
Fetching ...

Multi-Modal View Enhanced Large Vision Models for Long-Term Time Series Forecasting

ChengAo Shen, Wenchao Yu, Ziming Zhao, Dongjin Song, Wei Cheng, Haifeng Chen, Jingchao Ni

TL;DR

This work tackles long-term time series forecasting by leveraging multi-modal views (MMVs) of the same signal, transforming series into image-based representations to exploit pre-trained Large Vision Models. It identifies an inductive bias in SOTA LVM forecasters toward periodic patterns and introduces Dmmv, a decomposition-based MMV framework with two variants (Dmmv-s and Dmmv-a) that fuse a visual forecaster with a numerical forecaster via gating. Dmmv-a uses a novel BackCast-Mask adaptive decomposition to separate seasonal and trend components, achieving state-of-the-art performance across eight benchmarks and outperforming 14 baselines, including strong VisionTS baselines on non-fully-periodic data. The results demonstrate the value of combining MMVs with adaptive decomposition for robust, long-horizon time series forecasting and point to future work on efficiency and imaging techniques to further improve practical deployment.

Abstract

Time series, typically represented as numerical sequences, can also be transformed into images and texts, offering multi-modal views (MMVs) of the same underlying signal. These MMVs can reveal complementary patterns and enable the use of powerful pre-trained large models, such as large vision models (LVMs), for long-term time series forecasting (LTSF). However, as we identified in this work, the state-of-the-art (SOTA) LVM-based forecaster poses an inductive bias towards "forecasting periods". To harness this bias, we propose DMMV, a novel decomposition-based multi-modal view framework that leverages trend-seasonal decomposition and a novel backcast-residual based adaptive decomposition to integrate MMVs for LTSF. Comparative evaluations against 14 SOTA models across diverse datasets show that DMMV outperforms single-view and existing multi-modal baselines, achieving the best mean squared error (MSE) on 6 out of 8 benchmark datasets. The code for this paper is available at: https://github.com/D2I-Group/dmmv.

Multi-Modal View Enhanced Large Vision Models for Long-Term Time Series Forecasting

TL;DR

This work tackles long-term time series forecasting by leveraging multi-modal views (MMVs) of the same signal, transforming series into image-based representations to exploit pre-trained Large Vision Models. It identifies an inductive bias in SOTA LVM forecasters toward periodic patterns and introduces Dmmv, a decomposition-based MMV framework with two variants (Dmmv-s and Dmmv-a) that fuse a visual forecaster with a numerical forecaster via gating. Dmmv-a uses a novel BackCast-Mask adaptive decomposition to separate seasonal and trend components, achieving state-of-the-art performance across eight benchmarks and outperforming 14 baselines, including strong VisionTS baselines on non-fully-periodic data. The results demonstrate the value of combining MMVs with adaptive decomposition for robust, long-horizon time series forecasting and point to future work on efficiency and imaging techniques to further improve practical deployment.

Abstract

Time series, typically represented as numerical sequences, can also be transformed into images and texts, offering multi-modal views (MMVs) of the same underlying signal. These MMVs can reveal complementary patterns and enable the use of powerful pre-trained large models, such as large vision models (LVMs), for long-term time series forecasting (LTSF). However, as we identified in this work, the state-of-the-art (SOTA) LVM-based forecaster poses an inductive bias towards "forecasting periods". To harness this bias, we propose DMMV, a novel decomposition-based multi-modal view framework that leverages trend-seasonal decomposition and a novel backcast-residual based adaptive decomposition to integrate MMVs for LTSF. Comparative evaluations against 14 SOTA models across diverse datasets show that DMMV outperforms single-view and existing multi-modal baselines, achieving the best mean squared error (MSE) on 6 out of 8 benchmark datasets. The code for this paper is available at: https://github.com/D2I-Group/dmmv.

Paper Structure

This paper contains 29 sections, 5 equations, 17 figures, 8 tables, 1 algorithm.

Figures (17)

  • Figure 1: An overview of Dmmv framework. (a) Dmmv-s uses moving-average to extract trend and seasonal components. (b) Dmmv-a uses a backcast-residual decomposition to automatically learn trend and seasonal components. In (b), the gray blocks are gray-scale images. "?" marks masks.
  • Figure 2: An illustration of an LVM forecaster
  • Figure 3: An illustration of LVM forecaster's inductive bias. The time series has a period of 24. The vertical dashed lines mark the segment points. The example indicates a bias towards segment lengths that are multiples of the period in (a)(d) over other segment lengths in (b)(c).
  • Figure 4: An illustration of BCmask.
  • Figure 5: Critical difference (CD) diagram on the average rank of all 16 compared methods in terms of (a) MSE and (b) MAE over all benchmark datasets. The lower rank (left of the scale) is better.
  • ...and 12 more figures