Rethinking Adam for Time Series Forecasting: A Simple Heuristic to Improve Optimization under Distribution Shifts

Yuze Dong; Jinsong Wu

Rethinking Adam for Time Series Forecasting: A Simple Heuristic to Improve Optimization under Distribution Shifts

Yuze Dong, Jinsong Wu

TL;DR

TS_Adam, a lightweight variant of Adam that removes the second-order correction from the learning rate computation improves adaptability to distributional drift while preserving the optimizer core structure and requiring no additional hyperparameters is proposed.

Abstract

Time-series forecasting often faces challenges from non-stationarity, particularly distributional drift, where the data distribution evolves over time. This dynamic behavior can undermine the effectiveness of adaptive optimizers, such as Adam, which are typically designed for stationary objectives. In this paper, we revisit Adam in the context of non-stationary forecasting and identify that its second-order bias correction limits responsiveness to shifting loss landscapes. To address this, we propose TS_Adam, a lightweight variant that removes the second-order correction from the learning rate computation. This simple modification improves adaptability to distributional drift while preserving the optimizer core structure and requiring no additional hyperparameters. TS_Adam integrates easily into existing models and consistently improves performance across long- and short-term forecasting tasks. On the ETT datasets with the MICN model, it achieves an average reduction of 12.8% in MSE and 5.7% in MAE compared to Adam. These results underscore the practicality and versatility of TS_Adam as an effective optimization strategy for real-world forecasting scenarios involving non-stationary data. Code is available at: https://github.com/DD-459-1/TS_Adam.

Rethinking Adam for Time Series Forecasting: A Simple Heuristic to Improve Optimization under Distribution Shifts

TL;DR

Abstract

Paper Structure (29 sections, 2 theorems, 26 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 26 equations, 4 figures, 10 tables, 1 algorithm.

Introduction
Related works
Impact of Temporal Non-Stationarity on Optimization
Time-Dependence as a Key Driver of Temporal Non-Stationarity
Limitations of Adam in Learning Time-Varying Functions
Proposed method
Design of TS_Adam
Computational and Memory Overhead Analysis.
Convergence analysis
Experiments
Experimental Setup
Performance Comparison
Long-Term Forecasting
Short-Term Forecasting
Convergence Behavior and Practical Implications
...and 14 more sections

Key Result

Theorem 1

Under the above decomposition, each observation $Y_t$ follows a univariate Gaussian distribution whose mean and variance are explicit functions of time $t$: Consequently, both the first and second moments vary with $t$, rendering the process intrinsically nonstationary with an explicitly time-dependent structure.

Figures (4)

Figure 1: Evolution of the step size modulation term $\eta_t^{\mathrm{eff}}$ with respect to training steps under $\beta_1 \in \{0.8, 0.9, 0.95\}$, for both TS_Adam and Adam. For visualization purposes, the learning rate $\alpha$ is set to 1. Note that both optimizers asymptotically converge to $\eta_t^{\mathrm{eff}} = \alpha$ during long-term training.
Figure 2: Test loss curves during training for various optimizers on ETTh1 (T=720) and M4-Hourly, using MICN, PatchTST, and SegRNN. Each curve shows the mean test loss, with shaded areas indicating one standard deviation. For optimizers that stop early, a flat horizontal line denotes the final value without uncertainty shading.
Figure 3: Sensitivity of TS_Adam to $\beta_1$ and learning rate on ETTh1 (a) and ETTh2 (b). Each cell shows the mean ± standard deviation of the absolute MSE difference (TS_Adam − Adam), averaged over four prediction lengths.
Figure 4: Cumulative regret curves on ETTh1 and ETTh2 using the MICN model with prediction length 192. Each curve shows the mean regret over three runs, and the shaded area denotes $\pm 5$ standard deviations to enhance visibility. (a) ETTh1; (b) ETTh2.

Theorems & Definitions (6)

Theorem 1
proof
Theorem 2: Dynamic Regret Bound
proof
proof : Proof
proof

Rethinking Adam for Time Series Forecasting: A Simple Heuristic to Improve Optimization under Distribution Shifts

TL;DR

Abstract

Rethinking Adam for Time Series Forecasting: A Simple Heuristic to Improve Optimization under Distribution Shifts

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (6)