Table of Contents
Fetching ...

Tiny-TSM: Efficiently Training a Lightweight SOTA Time Series Foundation Model

Felix Birkel

TL;DR

This work tackles the challenge of building effective time series foundation models under tight compute constraints by training a 23M-parameter encoder on a single A100 GPU within a week using synthetic data from SynthTS. The approach combines a novel DART-Norm normalization to enable dense next-token training, a patched transformer architecture with a 2:1 time-to-feature attention ratio, and inference-time strategies (SIFI, symmetry ensembling, multivariate augmentation) to achieve state-of-the-art performance on broad benchmarks without large-scale tuning. Key contributions include DART-Norm, SynthTS, SIFI, multivariate feature augmentation, and a coarse-grid loss that balances short- and long-horizon forecasting, all demonstrated on the GIFT-Eval-NF suite. The practical impact is substantial: resource-efficient TS foundation modeling that rivals larger, curated-data models, enabling deployment on modest hardware while maintaining high predictive accuracy across medium- and long-horizon tasks.

Abstract

We present Tiny-TSM, a time series foundation model characterized by small scale, economical training, and state-of-the-art performance. It comprises 23M total parameters, trained on a single A100 GPU in less than a week using a new synthetic data generation and data augmentation pipeline (SynthTS). Without any neural architecture search, hyperparameter tuning, or scaling up model size, Tiny-TSM achieves state-of-the-art performance on a wide range of time series benchmark datasets, often outperforming much larger models and even matching the performance of much larger, industrial-scale, likely highly tuned foundation models. Specifically, Tiny-TSM outperforms all other time series foundation models we evaluated on medium- and long-term forecasting tasks under MSE loss, while short-term accuracy is still competitive with state-of-the-art models. We also introduce a causal input normalization scheme that enables time series models to be trained with dense next-token prediction loss, significantly accelerating convergence speed and reducing training time. All experiments were conducted on a single A100 GPU, illustrating the practicality of the proposed approach in a resource-constrained setting.

Tiny-TSM: Efficiently Training a Lightweight SOTA Time Series Foundation Model

TL;DR

This work tackles the challenge of building effective time series foundation models under tight compute constraints by training a 23M-parameter encoder on a single A100 GPU within a week using synthetic data from SynthTS. The approach combines a novel DART-Norm normalization to enable dense next-token training, a patched transformer architecture with a 2:1 time-to-feature attention ratio, and inference-time strategies (SIFI, symmetry ensembling, multivariate augmentation) to achieve state-of-the-art performance on broad benchmarks without large-scale tuning. Key contributions include DART-Norm, SynthTS, SIFI, multivariate feature augmentation, and a coarse-grid loss that balances short- and long-horizon forecasting, all demonstrated on the GIFT-Eval-NF suite. The practical impact is substantial: resource-efficient TS foundation modeling that rivals larger, curated-data models, enabling deployment on modest hardware while maintaining high predictive accuracy across medium- and long-horizon tasks.

Abstract

We present Tiny-TSM, a time series foundation model characterized by small scale, economical training, and state-of-the-art performance. It comprises 23M total parameters, trained on a single A100 GPU in less than a week using a new synthetic data generation and data augmentation pipeline (SynthTS). Without any neural architecture search, hyperparameter tuning, or scaling up model size, Tiny-TSM achieves state-of-the-art performance on a wide range of time series benchmark datasets, often outperforming much larger models and even matching the performance of much larger, industrial-scale, likely highly tuned foundation models. Specifically, Tiny-TSM outperforms all other time series foundation models we evaluated on medium- and long-term forecasting tasks under MSE loss, while short-term accuracy is still competitive with state-of-the-art models. We also introduce a causal input normalization scheme that enables time series models to be trained with dense next-token prediction loss, significantly accelerating convergence speed and reducing training time. All experiments were conducted on a single A100 GPU, illustrating the practicality of the proposed approach in a resource-constrained setting.

Paper Structure

This paper contains 32 sections, 16 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Tiny-TSM’s relative MSE performance on the GIFT-Eval-NF benchmark compared to other state-of-the-art time series foundation models.
  • Figure 2: Per-prediction-length results on GIFT-EVAL-NF
  • Figure 3: The Tiny-TSM architecture: We employ a patched encoder-based architecture. First, a linear patch embedding projects patches into the model's hidden dimension. In the encoder stack, we interleave time- and feature-attention blocks at a 2:1 ratio. The internal structure of an attention block is shown on the right. The output of the encoder stack is processed by two forecast heads: A cross-attention head and a linear head, the outputs of which are added together to form the final per-patch predictions.
  • Figure 4: The Tiny-TSM pipeline. Data is first normalized using DART-Norm, then patched and consumed by the model. The model produces per-patch predictions, all of which are then de-normalized.
  • Figure 5: Overview of the SynthTS synthetic data generation and augmentation framework. We visualize a simplified example of a single single series sampled from a base generator and how it is modified by univariate expansions, sparse feature mixing, and post-transforms.
  • ...and 2 more figures