Table of Contents
Fetching ...

Scaling Law for Time Series Forecasting

Jingzhe Shi, Qinwei Ma, Huan Ma, Lei Li

TL;DR

This work empirically evaluates various models using a diverse set of time series forecasting datasets, which verifies the validity of scaling law on dataset size and model complexity within the realm of time series forecasting, and validates the theoretical framework, particularly regarding the influence of look back horizon.

Abstract

Scaling law that rewards large datasets, complex models and enhanced data granularity has been observed in various fields of deep learning. Yet, studies on time series forecasting have cast doubt on scaling behaviors of deep learning methods for time series forecasting: while more training data improves performance, more capable models do not always outperform less capable models, and longer input horizons may hurt performance for some models. We propose a theory for scaling law for time series forecasting that can explain these seemingly abnormal behaviors. We take into account the impact of dataset size and model complexity, as well as time series data granularity, particularly focusing on the look-back horizon, an aspect that has been unexplored in previous theories. Furthermore, we empirically evaluate various models using a diverse set of time series forecasting datasets, which (1) verifies the validity of scaling law on dataset size and model complexity within the realm of time series forecasting, and (2) validates our theoretical framework, particularly regarding the influence of look back horizon. We hope our findings may inspire new models targeting time series forecasting datasets of limited size, as well as large foundational datasets and models for time series forecasting in future work. Code for our experiments has been made public at https://github.com/JingzheShi/ScalingLawForTimeSeriesForecasting.

Scaling Law for Time Series Forecasting

TL;DR

This work empirically evaluates various models using a diverse set of time series forecasting datasets, which verifies the validity of scaling law on dataset size and model complexity within the realm of time series forecasting, and validates the theoretical framework, particularly regarding the influence of look back horizon.

Abstract

Scaling law that rewards large datasets, complex models and enhanced data granularity has been observed in various fields of deep learning. Yet, studies on time series forecasting have cast doubt on scaling behaviors of deep learning methods for time series forecasting: while more training data improves performance, more capable models do not always outperform less capable models, and longer input horizons may hurt performance for some models. We propose a theory for scaling law for time series forecasting that can explain these seemingly abnormal behaviors. We take into account the impact of dataset size and model complexity, as well as time series data granularity, particularly focusing on the look-back horizon, an aspect that has been unexplored in previous theories. Furthermore, we empirically evaluate various models using a diverse set of time series forecasting datasets, which (1) verifies the validity of scaling law on dataset size and model complexity within the realm of time series forecasting, and (2) validates our theoretical framework, particularly regarding the influence of look back horizon. We hope our findings may inspire new models targeting time series forecasting datasets of limited size, as well as large foundational datasets and models for time series forecasting in future work. Code for our experiments has been made public at https://github.com/JingzheShi/ScalingLawForTimeSeriesForecasting.
Paper Structure (55 sections, 55 equations, 10 figures, 5 tables)

This paper contains 55 sections, 55 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Data Scaling. The proposed formula $loss(D)=A+B/D^\alpha$ fits well. More comparison with other formulas can be found at Appendix \ref{['app:other formulas']}.
  • Figure 2: Width Scaling. When the model is not powerful enough, $loss(W)=A+B/W^\alpha$ fits well for these situations. When data is scarce, a large model may lead to overfitting, as observed with ModernTCN on ETTm1.
  • Figure 3: Loss v.s. Horizon for a certain amount of training data, for different datasets and different models.
  • Figure 4: PCA results under Channel-Independent and Instance Normalization setting(left), Loss v.s. Horizon for certaim amount of training data on Exchange(middle) and ETTh1(right). Exchange dataset has $70\%$ data points compared to ETTh1 for training. However, since its feature degradation is stronger, the optimal horizon ($<30$) using $100\%$ of Exchange dataset is much smaller than the optimal horizon of the ETTh1 dataset ($>300$) with only $11\%$ of available training data.
  • Figure 5: Data scaling behavior for iTransformer (Channel-Dependent model, left) and Norm-MLP(Channel-Independent model, middle) and NLinear (CI, right).
  • ...and 5 more figures