Table of Contents
Fetching ...

Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Silvio Savarese, Caiming Xiong, Doyen Sahoo

TL;DR

This work tackles unified training for time series foundation models, arguing that frequency-based specialization is inadequate for the non-stationary, highly variable nature of real data. It introduces Moirai-MoE, a decoder-only Transformer that uses a single input/output projection and a sparse Mixture-of-Experts (MoE) to achieve token-level, data-driven specialization, including a novel clustering-based gating mechanism. Through extensive experiments on 39 datasets, Moirai-MoE achieves state-of-the-art performance in both in-distribution and zero-shot forecasting with far fewer activated parameters than competing methods. Analyses reveal frequency-invariant representations and progressive denoising, offering insights into how token-level MoE routing interacts with time-series dynamics and periodicity.

Abstract

Time series foundation models have demonstrated impressive performance as zero-shot forecasters. However, achieving effectively unified training on time series remains an open challenge. Existing approaches introduce some level of model specialization to account for the highly heterogeneous nature of time series data. For instance, Moirai pursues unified training by employing multiple input/output projection layers, each tailored to handle time series at a specific frequency. Similarly, TimesFM maintains a frequency embedding dictionary for this purpose. We identify two major drawbacks to this human-imposed frequency-level model specialization: (1) Frequency is not a reliable indicator of the underlying patterns in time series. For example, time series with different frequencies can display similar patterns, while those with the same frequency may exhibit varied patterns. (2) Non-stationarity is an inherent property of real-world time series, leading to varied distributions even within a short context window of a single time series. Frequency-level specialization is too coarse-grained to capture this level of diversity. To address these limitations, this paper introduces Moirai-MoE, using a single input/output projection layer while delegating the modeling of diverse time series patterns to the sparse mixture of experts (MoE) within Transformers. With these designs, Moirai-MoE reduces reliance on human-defined heuristics and enables automatic token-level specialization. Extensive experiments on 39 datasets demonstrate the superiority of Moirai-MoE over existing foundation models in both in-distribution and zero-shot scenarios. Furthermore, this study conducts comprehensive model analyses to explore the inner workings of time series MoE foundation models and provides valuable insights for future research.

Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

TL;DR

This work tackles unified training for time series foundation models, arguing that frequency-based specialization is inadequate for the non-stationary, highly variable nature of real data. It introduces Moirai-MoE, a decoder-only Transformer that uses a single input/output projection and a sparse Mixture-of-Experts (MoE) to achieve token-level, data-driven specialization, including a novel clustering-based gating mechanism. Through extensive experiments on 39 datasets, Moirai-MoE achieves state-of-the-art performance in both in-distribution and zero-shot forecasting with far fewer activated parameters than competing methods. Analyses reveal frequency-invariant representations and progressive denoising, offering insights into how token-level MoE routing interacts with time-series dynamics and periodicity.

Abstract

Time series foundation models have demonstrated impressive performance as zero-shot forecasters. However, achieving effectively unified training on time series remains an open challenge. Existing approaches introduce some level of model specialization to account for the highly heterogeneous nature of time series data. For instance, Moirai pursues unified training by employing multiple input/output projection layers, each tailored to handle time series at a specific frequency. Similarly, TimesFM maintains a frequency embedding dictionary for this purpose. We identify two major drawbacks to this human-imposed frequency-level model specialization: (1) Frequency is not a reliable indicator of the underlying patterns in time series. For example, time series with different frequencies can display similar patterns, while those with the same frequency may exhibit varied patterns. (2) Non-stationarity is an inherent property of real-world time series, leading to varied distributions even within a short context window of a single time series. Frequency-level specialization is too coarse-grained to capture this level of diversity. To address these limitations, this paper introduces Moirai-MoE, using a single input/output projection layer while delegating the modeling of diverse time series patterns to the sparse mixture of experts (MoE) within Transformers. With these designs, Moirai-MoE reduces reliance on human-defined heuristics and enables automatic token-level specialization. Extensive experiments on 39 datasets demonstrate the superiority of Moirai-MoE over existing foundation models in both in-distribution and zero-shot scenarios. Furthermore, this study conducts comprehensive model analyses to explore the inner workings of time series MoE foundation models and provides valuable insights for future research.

Paper Structure

This paper contains 38 sections, 6 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: An illustration of the challenges arising from grouping time series by frequency and imposing frequency-level model specialization: the diversity of patterns within the same frequency group, the similarity of patterns across different frequencies, and the variability of distributions within a single time series. The examples presented are derived from real time series in the Monash benchmark monash.
  • Figure 2: Comparison of $\textsc{Moirai}$ (left) and $\textsc{Moirai-MoE}$ (right).
  • Figure 3: In-distribution forecasting evaluation using 29 datasets from Monash monash. We use asterisks (*) to mark the methods that used the evaluation datasets here in their pretraining corpora. Aggregate MAE is reported, where the MAE for each dataset is normalized by the MAE of the seasonal naive forecast and the results are combined using the geometric mean.
  • Figure 4: Ablation studies of the training objective and gating function using ${\textsc{Moirai-MoE}}\textsubscript{S}$.
  • Figure 5: The first two columns are the comparison of embeddings from ${\textsc{Moirai}}\textsubscript{S}$ and ${\textsc{Moirai-MoE}}\textsubscript{S}$. The last two columns are the expert assignment distributions of ${\textsc{Moirai-MoE}}\textsubscript{S}$ in layer 1: the x-axis corresponds to the 32 experts in a layer, and the y-axis is the proportion of tokens that choose experts.
  • ...and 7 more figures