Table of Contents
Fetching ...

Loss Shaping Constraints for Long-Term Time Series Forecasting

Ignacio Hounie, Javier Porras-Valenzuela, Alejandro Ribeiro

TL;DR

The paper tackles the issue that multi-step time series forecasting methods often optimize average performance across a forecast window, which can produce uneven per-step errors. It introduces loss shaping constraints that enforce per-step upper bounds on the expected loss, and augments them with resilient relaxation to ensure feasibility during training. A Primal-Dual algorithm is developed to solve the constrained and relaxed problems, with empirical duality guarantees under certain conditions. Experiments on transformer-based forecasters across multiple datasets show that constraining per-step losses shapes the error distribution while maintaining competitive mean performance, and resilience improves feasibility and generalization in many settings.

Abstract

Several applications in time series forecasting require predicting multiple steps ahead. Despite the vast amount of literature in the topic, both classical and recent deep learning based approaches have mostly focused on minimising performance averaged over the predicted window. We observe that this can lead to disparate distributions of errors across forecasting steps, especially for recent transformer architectures trained on popular forecasting benchmarks. That is, optimising performance on average can lead to undesirably large errors at specific time-steps. In this work, we present a Constrained Learning approach for long-term time series forecasting that aims to find the best model in terms of average performance that respects a user-defined upper bound on the loss at each time-step. We call our approach loss shaping constraints because it imposes constraints on the loss at each time step, and leverage recent duality results to show that despite its non-convexity, the resulting problem has a bounded duality gap. We propose a practical Primal-Dual algorithm to tackle it, and demonstrate that the proposed approach exhibits competitive average performance in time series forecasting benchmarks, while shaping the distribution of errors across the predicted window.

Loss Shaping Constraints for Long-Term Time Series Forecasting

TL;DR

The paper tackles the issue that multi-step time series forecasting methods often optimize average performance across a forecast window, which can produce uneven per-step errors. It introduces loss shaping constraints that enforce per-step upper bounds on the expected loss, and augments them with resilient relaxation to ensure feasibility during training. A Primal-Dual algorithm is developed to solve the constrained and relaxed problems, with empirical duality guarantees under certain conditions. Experiments on transformer-based forecasters across multiple datasets show that constraining per-step losses shapes the error distribution while maintaining competitive mean performance, and resilience improves feasibility and generalization in many settings.

Abstract

Several applications in time series forecasting require predicting multiple steps ahead. Despite the vast amount of literature in the topic, both classical and recent deep learning based approaches have mostly focused on minimising performance averaged over the predicted window. We observe that this can lead to disparate distributions of errors across forecasting steps, especially for recent transformer architectures trained on popular forecasting benchmarks. That is, optimising performance on average can lead to undesirably large errors at specific time-steps. In this work, we present a Constrained Learning approach for long-term time series forecasting that aims to find the best model in terms of average performance that respects a user-defined upper bound on the loss at each time-step. We call our approach loss shaping constraints because it imposes constraints on the loss at each time step, and leverage recent duality results to show that despite its non-convexity, the resulting problem has a bounded duality gap. We propose a practical Primal-Dual algorithm to tackle it, and demonstrate that the proposed approach exhibits competitive average performance in time series forecasting benchmarks, while shaping the distribution of errors across the predicted window.
Paper Structure (26 sections, 25 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 26 sections, 25 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Test Mean Squared Error (MSE) computed at individual time steps on the forecasting window for Autoformer wu2021autoformer on exchange rate data using ERM and our approach.
  • Figure 2: Test MSE at each prediction step for two datasets across different models, datasets and predictive windows. The top row shows results for the Weather dataset with a predictive window length of 96 steps, and the second row corresponds to the Exchange Rate dataset with a predictive window length of 720 steps. Each column corresponds to a different architecture, and each curve represents a different training algorithm, we include both the ERM baseline and our method using a constant constraint across the window for all models.
  • Figure 3: Box plots of percentual changes across all experiments. The left column contains plots of MSE across experiments, and the right column is Window STD. We segment experiments by model and prediction length. The $x$ axes of each box plot are sorted by the mean ERM MSE (better models and easier datasets to the left). The full table with the results for each ERM, constrained and resilient setting can be found in Appendix \ref{['app-exp-results']}.
  • Figure 4: Distribution of percentual change in test MSE and Window STD across experiment settings when comparing a run of ERM with a constrained run. On average, STD decreases by 3.47% and MSE by 4.47% respectively.
  • Figure 5: Two instances of failure cases. The first row is the training and testing errors of Autoformer on Weather data with window length of 336. The second row is Informer on ECL data with window length of 720. The gray lines are the values of $\boldsymbol{\epsilon}$ used during training.
  • ...and 9 more figures