Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Xiaoming Shi; Shiyu Wang; Yuqi Nie; Dianqi Li; Zhou Ye; Qingsong Wen; Ming Jin

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, Ming Jin

TL;DR

Time-MoE introduces a scalable, sparse mixture-of-experts decoder-only transformer for universal time-series forecasting, addressing limitations of dense, fixed-horizon models. Trained on Time-300B, a large, multi-domain dataset, Time-MoE scales up to 2.4B parameters with about 1.1B activated, achieving superior zero-shot and in-distribution forecasting while reducing inference costs via sparse routing. The approach combines multi-resolution forecasting heads, robust loss with load-balancing, rotary embeddings, and careful data cleaning to enable stable training and strong generalization across domains. Empirical results on six benchmarks demonstrate consistent gains over state-of-the-art baselines, validating scaling laws in time-series forecasting and highlighting practical benefits for real-world deployment and future time-series foundation model research.

Abstract

Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows Time-MoE to scale effectively without a corresponding increase in inference costs. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data Time-300B, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

TL;DR

Abstract

Paper Structure (39 sections, 8 equations, 11 figures, 15 tables, 2 algorithms)

This paper contains 39 sections, 8 equations, 11 figures, 15 tables, 2 algorithms.

Introduction
Related Work
Methodology
Time-MoE Overview
Model Training
Time-300B Dataset
Loss Function
Model Configurations and Training Details
Main Results
Zero-shot Forecasting
In-distribution Forecasting
Ablation Study
Model Architecture.
Training Loss.
Scalability Analysis
...and 24 more sections

Figures (11)

Figure 1: Performance overview. (Left) Comparison between Time-MoE models and state-of-the-art time series foundation models, reporting the average zero-shot performance across six benchmark datasets. (Right) Comparison of few- and zero-shot performance between Time-MoE and dense variants, with similar effective FLOPs per time series token, across the same six benchmarks.
Figure 2: The architecture of Time-MoE, which is a decoder-only model. Given an input time series of arbitrary length, [mycircled,red]a11 we first tokenize it into a sequence of data points, [mycircled,red]a12 which are then encoded. These tokens are processed through $N$-stacked backbone layers, primarily consisting of causal multi-head self-attention and [mycircled,red]a13 sparse temporal mixture-of-expert layers. During training, [mycircled,red]a14 we optimize forecasting heads at multiple resolutions. For model inference, Time-MoE provides forecasts of flexible length by [mycircled,red]a15 dynamically scheduling these heads. Details about the causal multi-head self-attention are in Appendix \ref{['sec:implements']} and illustrated in Figure \ref{['fig:causal_attn']}.
Figure 3: Scalability analysis. (Left) Comparison of dense and sparse models in terms of training and inference costs. (Right) Average MSE for 96-horizon forecasting across six benchmarks, comparing Time-MoE and dense models, both trained from scratch with varying data sizes.
Figure 4: Gating scores for experts across different layers in the six benchmarks.
Figure 5: Causal attention layer.
...and 6 more figures

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

TL;DR

Abstract

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (11)