Table of Contents
Fetching ...

Rethinking Adam for Time Series Forecasting: A Simple Heuristic to Improve Optimization under Distribution Shifts

Yuze Dong, Jinsong Wu

TL;DR

TS_Adam, a lightweight variant of Adam that removes the second-order correction from the learning rate computation improves adaptability to distributional drift while preserving the optimizer core structure and requiring no additional hyperparameters is proposed.

Abstract

Time-series forecasting often faces challenges from non-stationarity, particularly distributional drift, where the data distribution evolves over time. This dynamic behavior can undermine the effectiveness of adaptive optimizers, such as Adam, which are typically designed for stationary objectives. In this paper, we revisit Adam in the context of non-stationary forecasting and identify that its second-order bias correction limits responsiveness to shifting loss landscapes. To address this, we propose TS_Adam, a lightweight variant that removes the second-order correction from the learning rate computation. This simple modification improves adaptability to distributional drift while preserving the optimizer core structure and requiring no additional hyperparameters. TS_Adam integrates easily into existing models and consistently improves performance across long- and short-term forecasting tasks. On the ETT datasets with the MICN model, it achieves an average reduction of 12.8% in MSE and 5.7% in MAE compared to Adam. These results underscore the practicality and versatility of TS_Adam as an effective optimization strategy for real-world forecasting scenarios involving non-stationary data. Code is available at: https://github.com/DD-459-1/TS_Adam.

Rethinking Adam for Time Series Forecasting: A Simple Heuristic to Improve Optimization under Distribution Shifts

TL;DR

TS_Adam, a lightweight variant of Adam that removes the second-order correction from the learning rate computation improves adaptability to distributional drift while preserving the optimizer core structure and requiring no additional hyperparameters is proposed.

Abstract

Time-series forecasting often faces challenges from non-stationarity, particularly distributional drift, where the data distribution evolves over time. This dynamic behavior can undermine the effectiveness of adaptive optimizers, such as Adam, which are typically designed for stationary objectives. In this paper, we revisit Adam in the context of non-stationary forecasting and identify that its second-order bias correction limits responsiveness to shifting loss landscapes. To address this, we propose TS_Adam, a lightweight variant that removes the second-order correction from the learning rate computation. This simple modification improves adaptability to distributional drift while preserving the optimizer core structure and requiring no additional hyperparameters. TS_Adam integrates easily into existing models and consistently improves performance across long- and short-term forecasting tasks. On the ETT datasets with the MICN model, it achieves an average reduction of 12.8% in MSE and 5.7% in MAE compared to Adam. These results underscore the practicality and versatility of TS_Adam as an effective optimization strategy for real-world forecasting scenarios involving non-stationary data. Code is available at: https://github.com/DD-459-1/TS_Adam.
Paper Structure (29 sections, 2 theorems, 26 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 26 equations, 4 figures, 10 tables, 1 algorithm.

Key Result

Theorem 1

Under the above decomposition, each observation $Y_t$ follows a univariate Gaussian distribution whose mean and variance are explicit functions of time $t$: Consequently, both the first and second moments vary with $t$, rendering the process intrinsically nonstationary with an explicitly time-dependent structure.

Figures (4)

  • Figure 1: Evolution of the step size modulation term $\eta_t^{\mathrm{eff}}$ with respect to training steps under $\beta_1 \in \{0.8, 0.9, 0.95\}$, for both TS_Adam and Adam. For visualization purposes, the learning rate $\alpha$ is set to 1. Note that both optimizers asymptotically converge to $\eta_t^{\mathrm{eff}} = \alpha$ during long-term training.
  • Figure 2: Test loss curves during training for various optimizers on ETTh1 (T=720) and M4-Hourly, using MICN, PatchTST, and SegRNN. Each curve shows the mean test loss, with shaded areas indicating one standard deviation. For optimizers that stop early, a flat horizontal line denotes the final value without uncertainty shading.
  • Figure 3: Sensitivity of TS_Adam to $\beta_1$ and learning rate on ETTh1 (a) and ETTh2 (b). Each cell shows the mean ± standard deviation of the absolute MSE difference (TS_Adam − Adam), averaged over four prediction lengths.
  • Figure 4: Cumulative regret curves on ETTh1 and ETTh2 using the MICN model with prediction length 192. Each curve shows the mean regret over three runs, and the shaded area denotes $\pm 5$ standard deviations to enhance visibility. (a) ETTh1; (b) ETTh2.

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Theorem 2: Dynamic Regret Bound
  • proof
  • proof : Proof
  • proof