Table of Contents
Fetching ...

Scaling-laws for Large Time-series Models

Thomas D. P. Edwards, James Alvey, Justin Alsing, Nam H. Nguyen, Benjamin D. Wandelt

TL;DR

The paper addresses the absence of predictable scaling laws for time-series foundation models. It demonstrates that decoder-only LTMs trained on a large, diverse univariate time-series corpus follow power-law scaling with $N_p$, $ rak{D}$, and $ rak{C}$ across about five orders of magnitude, for $ ext{MSE}$, $ ext{CRPS}$, and $ ext{log-likelihood}$. Contributions include showing robustness to architectural choices such as $N_{ ext{heads}}$ and modest dependence on aspect ratio (below ~100), proposing a learnable positional encoding and a Student's-$t$ head, and documenting practical training parameters and data-balancing strategies. This work provides a foundation for resource-allocation planning in time-series foundation modeling and underscores data diversity as a critical ingredient for observing scaling laws, while outlining avenues for extending to multivariate data, longer contexts, and alternative distribution heads.

Abstract

Scaling laws for large language models (LLMs) have provided useful guidance in training ever larger models for predictable performance gains. Time series forecasting shares a similar sequential structure to language, and is amenable to large-scale transformer architectures. Here we show that foundational decoder-only time series transformer models exhibit analogous scaling-behavior to LLMs, with architectural details (aspect ratio and number of heads) having a minimal effect over broad ranges. We assemble a large corpus of heterogenous time series data on which to train, and establish for the first time power-law scaling with parameter count, dataset size, and training compute, spanning five orders of magnitude.

Scaling-laws for Large Time-series Models

TL;DR

The paper addresses the absence of predictable scaling laws for time-series foundation models. It demonstrates that decoder-only LTMs trained on a large, diverse univariate time-series corpus follow power-law scaling with , , and across about five orders of magnitude, for , , and . Contributions include showing robustness to architectural choices such as and modest dependence on aspect ratio (below ~100), proposing a learnable positional encoding and a Student's- head, and documenting practical training parameters and data-balancing strategies. This work provides a foundation for resource-allocation planning in time-series foundation modeling and underscores data diversity as a critical ingredient for observing scaling laws, while outlining avenues for extending to multivariate data, longer contexts, and alternative distribution heads.

Abstract

Scaling laws for large language models (LLMs) have provided useful guidance in training ever larger models for predictable performance gains. Time series forecasting shares a similar sequential structure to language, and is amenable to large-scale transformer architectures. Here we show that foundational decoder-only time series transformer models exhibit analogous scaling-behavior to LLMs, with architectural details (aspect ratio and number of heads) having a minimal effect over broad ranges. We assemble a large corpus of heterogenous time series data on which to train, and establish for the first time power-law scaling with parameter count, dataset size, and training compute, spanning five orders of magnitude.
Paper Structure (16 sections, 4 figures, 7 tables)

This paper contains 16 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Test Loss Scaling Laws: Minimum MSE (left), CRPS (middle) and log-likelihood (right) in-sequence test metrics as a function of the number of parameters (top), compute (middle), and dataset size (bottom).
  • Figure 2: In-sequence test loss to forecasting: Here we show the connection between improved in-sequence test loss and forecasting performance as a function of model size. In particular, we show the true data in black with $1\sigma$ ranges for both in-sequence and forecasting predictions. It is clear that as in-sequence test loss decreases, forecasting also becomes substantially more predictive.
  • Figure 3: Importance of Learning Rate: Here we show the minimum CRPS measured on the test data as a function of the maximum learning rate reached at the end of the linear warm up schedule. Crosses indicate that the model diverged before training was complete. There is a clear optimum max learning rate which decreases as a function of model size/number of parameters.
  • Figure 4: Importance of Transformer Architecture: We show the minimum CRPS on the test set as a function of architecture choices and number of parameters. Left: Performance on the test data has a weak dependence on aspect ratio below $<100$ but degrades significantly $>128$. We therefore keep aspect ratios $<70$ for all scaling runs. Right: Here we see that the number of attention heads has no noticeable affect on the performance for both model sizes tested. We fix the number of heads to four for the scaling runs.