Table of Contents
Fetching ...

Minimal Time Series Transformer

Joni-Kristian Kämäräinen

TL;DR

This work investigates how to adapt the vanilla Transformer for continuous-valued time series forecasting with minimal changes. By replacing the token embedding with a linear projection, the MiTS-Transformer provides a simple baseline, and the PoTS-Transformer introduces positional-encoding expansion to handle long sequences with a compact model. Across sinusoid-based Type 1–Type 3 data, MiTS demonstrates strong learning on Type 1–2, while PoTS-Transformer often outperforms MiTS on the most challenging Type 3, highlighting the trade-off between model size and overfitting. The study suggests that simple, well-chosen modifications can yield effective transformer-based time series forecasting without resorting to complex architectures.

Abstract

Transformer is the state-of-the-art model for many natural language processing, computer vision, and audio analysis problems. Transformer effectively combines information from the past input and output samples in auto-regressive manner so that each sample becomes aware of all inputs and outputs. In sequence-to-sequence (Seq2Seq) modeling, the transformer processed samples become effective in predicting the next output. Time series forecasting is a Seq2Seq problem. The original architecture is defined for discrete input and output sequence tokens, but to adopt it for time series, the model must be adapted for continuous data. This work introduces minimal adaptations to make the original transformer architecture suitable for continuous value time series data.

Minimal Time Series Transformer

TL;DR

This work investigates how to adapt the vanilla Transformer for continuous-valued time series forecasting with minimal changes. By replacing the token embedding with a linear projection, the MiTS-Transformer provides a simple baseline, and the PoTS-Transformer introduces positional-encoding expansion to handle long sequences with a compact model. Across sinusoid-based Type 1–Type 3 data, MiTS demonstrates strong learning on Type 1–2, while PoTS-Transformer often outperforms MiTS on the most challenging Type 3, highlighting the trade-off between model size and overfitting. The study suggests that simple, well-chosen modifications can yield effective transformer-based time series forecasting without resorting to complex architectures.

Abstract

Transformer is the state-of-the-art model for many natural language processing, computer vision, and audio analysis problems. Transformer effectively combines information from the past input and output samples in auto-regressive manner so that each sample becomes aware of all inputs and outputs. In sequence-to-sequence (Seq2Seq) modeling, the transformer processed samples become effective in predicting the next output. Time series forecasting is a Seq2Seq problem. The original architecture is defined for discrete input and output sequence tokens, but to adopt it for time series, the model must be adapted for continuous data. This work introduces minimal adaptations to make the original transformer architecture suitable for continuous value time series data.

Paper Structure

This paper contains 17 sections, 3 equations, 8 figures.

Figures (8)

  • Figure 1: Example sinusoids used in the experiments.
  • Figure 2: Sinusoids (Type 2 sequences) of 31 samples divided into source $\mathbf{X}$ (samples 0-18 in blue) and target $\mathbf{Y}$ parts (samples 19-30 in green).
  • Figure 3: Single sequence sanity check (Type 1 data) of the MiTS-Transformer implementation (see Jupyter notebook).
  • Figure 4: Four sequence (Type 2) results of MiTS-Transformer (model parameters: d_model=8, dim_feedforward=8, the total of 1,289 learnable parameters). The signal frequencies were 0/31, 1/31, 2/31, and 3/31.
  • Figure 5: Arbitrary sequences (Type 3) results for MiTS-Transformer (d_model=8, dim_feedforward=8, 1289 params). Data consists of arbitrary sequences of Sinusoids with the frequency in (0/31, 3/31).
  • ...and 3 more figures