Tiny-TSM: Efficiently Training a Lightweight SOTA Time Series Foundation Model
Felix Birkel
TL;DR
This work tackles the challenge of building effective time series foundation models under tight compute constraints by training a 23M-parameter encoder on a single A100 GPU within a week using synthetic data from SynthTS. The approach combines a novel DART-Norm normalization to enable dense next-token training, a patched transformer architecture with a 2:1 time-to-feature attention ratio, and inference-time strategies (SIFI, symmetry ensembling, multivariate augmentation) to achieve state-of-the-art performance on broad benchmarks without large-scale tuning. Key contributions include DART-Norm, SynthTS, SIFI, multivariate feature augmentation, and a coarse-grid loss that balances short- and long-horizon forecasting, all demonstrated on the GIFT-Eval-NF suite. The practical impact is substantial: resource-efficient TS foundation modeling that rivals larger, curated-data models, enabling deployment on modest hardware while maintaining high predictive accuracy across medium- and long-horizon tasks.
Abstract
We present Tiny-TSM, a time series foundation model characterized by small scale, economical training, and state-of-the-art performance. It comprises 23M total parameters, trained on a single A100 GPU in less than a week using a new synthetic data generation and data augmentation pipeline (SynthTS). Without any neural architecture search, hyperparameter tuning, or scaling up model size, Tiny-TSM achieves state-of-the-art performance on a wide range of time series benchmark datasets, often outperforming much larger models and even matching the performance of much larger, industrial-scale, likely highly tuned foundation models. Specifically, Tiny-TSM outperforms all other time series foundation models we evaluated on medium- and long-term forecasting tasks under MSE loss, while short-term accuracy is still competitive with state-of-the-art models. We also introduce a causal input normalization scheme that enables time series models to be trained with dense next-token prediction loss, significantly accelerating convergence speed and reducing training time. All experiments were conducted on a single A100 GPU, illustrating the practicality of the proposed approach in a resource-constrained setting.
