Table of Contents
Fetching ...

From Pixels to Predictions: Spectrogram and Vision Transformer for Better Time Series Forecasting

Zhen Zeng, Rachneet Kaur, Suchetha Siddagangappa, Tucker Balch, Manuela Veloso

TL;DR

This work tackles time series forecasting by leveraging a time-frequency perspective. It introduces a spectrogram-based visual representation augmented with intensity information and processes it with a vision transformer to learn joint time-frequency patterns. Across synthetic, temperature, and financial datasets, the proposed ViT-num-spec approach achieves strong performance, outperforming statistical baselines and other vision-based methods, highlighting the value of multimodal inputs. The framework offers a domain-agnostic forecasting approach that leverages successful computer vision models for time-series prediction with practical implications for finance and beyond.

Abstract

Time series forecasting plays a crucial role in decision-making across various domains, but it presents significant challenges. Recent studies have explored image-driven approaches using computer vision models to address these challenges, often employing lineplots as the visual representation of time series data. In this paper, we propose a novel approach that uses time-frequency spectrograms as the visual representation of time series data. We introduce the use of a vision transformer for multimodal learning, showcasing the advantages of our approach across diverse datasets from different domains. To evaluate its effectiveness, we compare our method against statistical baselines (EMA and ARIMA), a state-of-the-art deep learning-based approach (DeepAR), other visual representations of time series data (lineplot images), and an ablation study on using only the time series as input. Our experiments demonstrate the benefits of utilizing spectrograms as a visual representation for time series data, along with the advantages of employing a vision transformer for simultaneous learning in both the time and frequency domains.

From Pixels to Predictions: Spectrogram and Vision Transformer for Better Time Series Forecasting

TL;DR

This work tackles time series forecasting by leveraging a time-frequency perspective. It introduces a spectrogram-based visual representation augmented with intensity information and processes it with a vision transformer to learn joint time-frequency patterns. Across synthetic, temperature, and financial datasets, the proposed ViT-num-spec approach achieves strong performance, outperforming statistical baselines and other vision-based methods, highlighting the value of multimodal inputs. The framework offers a domain-agnostic forecasting approach that leverages successful computer vision models for time-series prediction with practical implications for finance and beyond.

Abstract

Time series forecasting plays a crucial role in decision-making across various domains, but it presents significant challenges. Recent studies have explored image-driven approaches using computer vision models to address these challenges, often employing lineplots as the visual representation of time series data. In this paper, we propose a novel approach that uses time-frequency spectrograms as the visual representation of time series data. We introduce the use of a vision transformer for multimodal learning, showcasing the advantages of our approach across diverse datasets from different domains. To evaluate its effectiveness, we compare our method against statistical baselines (EMA and ARIMA), a state-of-the-art deep learning-based approach (DeepAR), other visual representations of time series data (lineplot images), and an ablation study on using only the time series as input. Our experiments demonstrate the benefits of utilizing spectrograms as a visual representation for time series data, along with the advantages of employing a vision transformer for simultaneous learning in both the time and frequency domains.
Paper Structure (31 sections, 2 equations, 4 figures, 1 table)

This paper contains 31 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Visual representation of time series in the form of a time-frequency spectrogram augmented with intensities of time series at the top
  • Figure 2: Illustrations of the inputs of the three datasets: a) synthetic, b) temperature, and c) financial stock prices. The top panels show the raw time series represented as lineplots and the bottom panels depict the augmented time-frequency spectrogram. Each input time series consists of 80 steps for the synthetic and financial datasets, while the temperature dataset has 50 steps. For the financial and temperature data, each time step represents a 1-day time interval.
  • Figure 3: Overview of the proposed approach.
  • Figure 4: Qualitative examples for predictions for the three datasets: a) synthetic, b) temperature, and c) financial stock prices.