Table of Contents
Fetching ...

Selective Learning for Deep Time Series Forecasting

Yisong Fu, Zezhi Shao, Chengqing Yu, Yujie Li, Zhulin An, Qi Wang, Yongjun Xu, Fei Wang

TL;DR

Deep time-series forecasting models suffer from overfitting when trained with a uniform, per-timestep regression objective. The authors propose selective learning, a model-agnostic method that trains on a subset of timesteps filtered by a dual-mask: an uncertainty mask based on residual entropy and an anomaly mask based on residual lower-bound estimation. Across eight real-world datasets and multiple backbones, selective learning yields consistent improvements (e.g., up to 37.4% MSE reduction for Informer) and enhances zero-shot generalization. This approach offers a practical, generalizable route to stronger TSF performance by focusing learning on generalizable patterns rather than noisy or anomalous timesteps.

Abstract

Benefiting from high capacity for capturing complex temporal patterns, deep learning (DL) has significantly advanced time series forecasting (TSF). However, deep models tend to suffer from severe overfitting due to the inherent vulnerability of time series to noise and anomalies. The prevailing DL paradigm uniformly optimizes all timesteps through the MSE loss and learns those uncertain and anomalous timesteps without difference, ultimately resulting in overfitting. To address this, we propose a novel selective learning strategy for deep TSF. Specifically, selective learning screens a subset of the whole timesteps to calculate the MSE loss in optimization, guiding the model to focus on generalizable timesteps while disregarding non-generalizable ones. Our framework introduces a dual-mask mechanism to target timesteps: (1) an uncertainty mask leveraging residual entropy to filter uncertain timesteps, and (2) an anomaly mask employing residual lower bound estimation to exclude anomalous timesteps. Extensive experiments across eight real-world datasets demonstrate that selective learning can significantly improve the predictive performance for typical state-of-the-art deep models, including 37.4% MSE reduction for Informer, 8.4% for TimesNet, and 6.5% for iTransformer.

Selective Learning for Deep Time Series Forecasting

TL;DR

Deep time-series forecasting models suffer from overfitting when trained with a uniform, per-timestep regression objective. The authors propose selective learning, a model-agnostic method that trains on a subset of timesteps filtered by a dual-mask: an uncertainty mask based on residual entropy and an anomaly mask based on residual lower-bound estimation. Across eight real-world datasets and multiple backbones, selective learning yields consistent improvements (e.g., up to 37.4% MSE reduction for Informer) and enhances zero-shot generalization. This approach offers a practical, generalizable route to stronger TSF performance by focusing learning on generalizable patterns rather than noisy or anomalous timesteps.

Abstract

Benefiting from high capacity for capturing complex temporal patterns, deep learning (DL) has significantly advanced time series forecasting (TSF). However, deep models tend to suffer from severe overfitting due to the inherent vulnerability of time series to noise and anomalies. The prevailing DL paradigm uniformly optimizes all timesteps through the MSE loss and learns those uncertain and anomalous timesteps without difference, ultimately resulting in overfitting. To address this, we propose a novel selective learning strategy for deep TSF. Specifically, selective learning screens a subset of the whole timesteps to calculate the MSE loss in optimization, guiding the model to focus on generalizable timesteps while disregarding non-generalizable ones. Our framework introduces a dual-mask mechanism to target timesteps: (1) an uncertainty mask leveraging residual entropy to filter uncertain timesteps, and (2) an anomaly mask employing residual lower bound estimation to exclude anomalous timesteps. Extensive experiments across eight real-world datasets demonstrate that selective learning can significantly improve the predictive performance for typical state-of-the-art deep models, including 37.4% MSE reduction for Informer, 8.4% for TimesNet, and 6.5% for iTransformer.

Paper Structure

This paper contains 55 sections, 1 theorem, 15 equations, 6 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

The error bound between variance estimation under distinct parameters $\hat{\sigma}_t^2$ and that under identical parameters $\hat{\sigma}_t^2(\boldsymbol{\theta}_\tau)$ satisfies: where $K$ is the number of iterations per epoch, and $L_f, R, G$ are constants.

Figures (6)

  • Figure 1: Left: When optimizing the model through MSE loss, our proposed selective learning calculates MSE only on a subset of timesteps, while masking out uncertain and anomalous ones that are non-generalizable. Right: Test MSE curves of iTransformer during training on the ETTh1 dataset (prediction length $F=336$). The model exhibits severe overfitting, but this is effectively mitigated through selective learning, yielding an 8.1% reduction in test MSE with stable convergence.
  • Figure 2: (a) Overall framework of selective learning. (b) Uncertainty mask. (c) Anomaly mask.
  • Figure 3: Forecasting results under different masking ratios. The prediction length is 336.
  • Figure 4: Forecasting performance with iTransformer as backbone and various estimation models. The results are averaged from all prediction lengths.
  • Figure 5: Test MSE curve on the ETTh1 dataset. The prediction length is 336.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1: Upper Bound for Variance Estimation Error
  • proof