Table of Contents
Fetching ...

Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting

Liran Nochumsohn, Raz Marshanski, Hedi Zisling, Omri Azencot

TL;DR

Super-Linear targets efficient generalization for time series forecasting by marrying frequency-specialized linear experts with a lightweight spectral gating mechanism. By pretraining a diverse set of univariate linear experts on resampled data across multiple frequencies and training a sparse router to select relevant experts, the model achieves strong zero-shot and full-shot performance while dramatically reducing parameters and inference time. The work provides theoretical bias–variance insights for the gating mechanism and demonstrates robust generalization across diverse benchmarks and sampling rates, with interpretable expert activations linked to data frequency. Overall, Super-Linear offers a practical, scalable alternative to large Transformer-based TSF foundations, delivering competitive accuracy with substantial efficiency gains and interpretability.

Abstract

Time series forecasting (TSF) is critical in domains like energy, finance, healthcare, and logistics, requiring models that generalize across diverse datasets. Large pre-trained models such as Chronos and Time-MoE show strong zero-shot (ZS) performance but suffer from high computational costs. In this work, we introduce Super-Linear, a lightweight and scalable mixture-of-experts (MoE) model for general forecasting. It replaces deep architectures with simple frequency-specialized linear experts, trained on resampled data across multiple frequency regimes. A lightweight spectral gating mechanism dynamically selects relevant experts, enabling efficient, accurate forecasting. Despite its simplicity, Super-Linear demonstrates strong performance across benchmarks, while substantially improving efficiency, robustness to sampling rates, and interpretability. The implementation of Super-Linear is available at: \href{https://github.com/azencot-group/SuperLinear}{https://github.com/azencot-group/SuperLinear}.

Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting

TL;DR

Super-Linear targets efficient generalization for time series forecasting by marrying frequency-specialized linear experts with a lightweight spectral gating mechanism. By pretraining a diverse set of univariate linear experts on resampled data across multiple frequencies and training a sparse router to select relevant experts, the model achieves strong zero-shot and full-shot performance while dramatically reducing parameters and inference time. The work provides theoretical bias–variance insights for the gating mechanism and demonstrates robust generalization across diverse benchmarks and sampling rates, with interpretable expert activations linked to data frequency. Overall, Super-Linear offers a practical, scalable alternative to large Transformer-based TSF foundations, delivering competitive accuracy with substantial efficiency gains and interpretability.

Abstract

Time series forecasting (TSF) is critical in domains like energy, finance, healthcare, and logistics, requiring models that generalize across diverse datasets. Large pre-trained models such as Chronos and Time-MoE show strong zero-shot (ZS) performance but suffer from high computational costs. In this work, we introduce Super-Linear, a lightweight and scalable mixture-of-experts (MoE) model for general forecasting. It replaces deep architectures with simple frequency-specialized linear experts, trained on resampled data across multiple frequency regimes. A lightweight spectral gating mechanism dynamically selects relevant experts, enabling efficient, accurate forecasting. Despite its simplicity, Super-Linear demonstrates strong performance across benchmarks, while substantially improving efficiency, robustness to sampling rates, and interpretability. The implementation of Super-Linear is available at: \href{https://github.com/azencot-group/SuperLinear}{https://github.com/azencot-group/SuperLinear}.

Paper Structure

This paper contains 61 sections, 19 equations, 12 figures, 15 tables, 1 algorithm.

Figures (12)

  • Figure 1: Performance versus inference time trade-off across different prominent pretrained TSFM on the GIFT-Eval and LTSF benchmarks.
  • Figure 2: Left: Forecasting performance of linear models on 12 sine-wave datasets with varying frequencies and added random walk noise. Performance improves progressively with more experts. Right: Weight sensitivity to seasonal lags—training on datasets with different seasonality (e.g., Births vs. Electricity) leads to divergent weight structures, suboptimal when shared.
  • Figure 3: Super-Linear architecture overview. A frequency-aware gating router computes sparse scores from the input frequencies, dynamically selecting a subset of linear experts, (1) including linear experts (2) whose predictions are combined to produce the final forecast (3).
  • Figure 4: Super-Linear training framework. Data is resampled to enrich frequency diversity. Stage 1: Each expert is trained independently on a predefined frequency ${\omega}_i$. Stage 2: The router and complementary layers are trained with frozen experts to enable dynamic expert selection.
  • Figure 5: GIFT-Eval performance and parameter count of Super-Linear compared to prominent foundation models in TSF . The MASE score represents the geometric mean MASE across datasets, normalized by the seasonal-naive.
  • ...and 7 more figures