Table of Contents
Fetching ...

Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

Yong Liu, Xingjian Su, Shiyu Wang, Haoran Zhang, Haixuan Liu, Yuxuan Wang, Zhou Ye, Yang Xiang, Jianmin Wang, Mingsheng Long

TL;DR

Timmer-S1 achieves state-of-the-art forecasting performance, attaining the best MASE and CRPS scores as a pre-trained model and pioneer a post-training stage, including continued pre-training and long-context extension, to enhance short-term and long-context performance.

Abstract

We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial Scaling in three dimensions: model architecture, dataset, and training pipeline. Timer-S1 integrates sparse TimeMoE blocks and generic TimeSTP blocks for Serial-Token Prediction (STP), a generic training objective that adheres to the serial nature of forecasting. The proposed paradigm introduces serial computations to improve long-term predictions while avoiding costly rolling-style inference and pronounced error accumulation in the standard next-token prediction. Pursuing a high-quality and unbiased training dataset, we curate TimeBench, a corpus with one trillion time points, and apply meticulous data augmentation to mitigate predictive bias. We further pioneer a post-training stage, including continued pre-training and long-context extension, to enhance short-term and long-context performance. Evaluated on the large-scale GIFT-Eval leaderboard, Timer-S1 achieves state-of-the-art forecasting performance, attaining the best MASE and CRPS scores as a pre-trained model. Timer-S1 will be released to facilitate further research.

Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

TL;DR

Timmer-S1 achieves state-of-the-art forecasting performance, attaining the best MASE and CRPS scores as a pre-trained model and pioneer a post-training stage, including continued pre-training and long-context extension, to enhance short-term and long-context performance.

Abstract

We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial Scaling in three dimensions: model architecture, dataset, and training pipeline. Timer-S1 integrates sparse TimeMoE blocks and generic TimeSTP blocks for Serial-Token Prediction (STP), a generic training objective that adheres to the serial nature of forecasting. The proposed paradigm introduces serial computations to improve long-term predictions while avoiding costly rolling-style inference and pronounced error accumulation in the standard next-token prediction. Pursuing a high-quality and unbiased training dataset, we curate TimeBench, a corpus with one trillion time points, and apply meticulous data augmentation to mitigate predictive bias. We further pioneer a post-training stage, including continued pre-training and long-context extension, to enhance short-term and long-context performance. Evaluated on the large-scale GIFT-Eval leaderboard, Timer-S1 achieves state-of-the-art forecasting performance, attaining the best MASE and CRPS scores as a pre-trained model. Timer-S1 will be released to facilitate further research.
Paper Structure (31 sections, 16 equations, 18 figures)

This paper contains 31 sections, 16 equations, 18 figures.

Figures (18)

  • Figure 1: Forecasting into the long term accumulates uncertainty, as the prediction of each step depends on all preceding estimations, which positions time series forecasting as a serial problem liu2025serial. Parallel-forecasting models, which predict multiple future steps simultaneously, do not scale with sufficient serial computations to reliably capture the recurrent dependencies. Although autoregressive models mirror the serial nature of the task by "predicting step by step", their iterative rolling mechanism over the input still entails significant computational overhead.
  • Figure 2: The Serial Scaling of Timer-S1 is achieved by (a) serial forecasting, which efficiently produces multi-step prediction with serial computations; (b) data scaling with data augmentation applied to TimeBench liu2025sundial, a corpus of over one trillion time points; and (c) post-training that comprehensively enhances the capability of the model.
  • Figure 3: A timeline of representative time series forecasting models in recent years. This timeline is established according to the release date of the paper or technical report for a model. Notably, the Timer model is a continuously developed family of time series foundation models that has presents sustained scaling in model size across its generations.
  • Figure 4: Overall architecture of Timer-S1. The input time series is re-normalized and divided into patch tokens. These patch embeddings are fed into a decoder-only Transformer. The Transformer backbone consists of a series of TimeMoE blocks, where Pre-RMSNorm and QK-Norm henry2020query are adapted for training stability, followed by a sequence of TimeSTP blocks. TimeSTP extends TimeMoE by additionally conditioning on the initial input embeddings, iteratively refining the token embeddings from the previous block, and generating shifted-by-one token predictions. All output embeddings are projected by a shared forecasting head to produce quantile predictions. Timer-S1 enables serial forecasting, where predictions of longer horizons actually undergo more serial computations in the Transformer block.
  • Figure 5: Illustration of the TimeBench dataset and the training pipeline of Timer-S1. TimeBench integrates over one trillion time points from multiple domains, processed through quality-focused preprocessing, predictability assessment, and diversity-enhancing augmentation. TimeBench is loaded in a single-series sequence format for learning univariate patterns. Timer-S1’s training follows a multi-stage design: it is first pre-trained via serial-token prediction with uniform horizon weighting; then it undergoes continued pre-training using a horizon-decay objective to enhance short-term accuracy; and extends its context length from 2880 to 11520 through long-context adaptation.
  • ...and 13 more figures