Table of Contents
Fetching ...

Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting

Wanjin Feng, Yuan Yuan, Jingtao Ding, Yong Li

TL;DR

This work addresses evaluation skew in time-series forecasting by separating model performance from data difficulty. It introduces Spectral Coherence Predictability (SCP), a scalable, task-aligned instance-difficulty proxy, and Linear Utilization Ratio (LUR), a frequency-resolved diagnostic of how well models exploit linearly predictable structure. Through toy and real-world experiments, SCP proves well-calibrated with empirical errors, reveals predictability drift over time, and uncovers architecture-dependent strengths via band-wise analyses. The results advocate shifting from single-score leaderboards to predictability-aware evaluation to enable fairer comparisons and deeper insights into model behavior. Overall, the framework lays groundwork for adaptive architectures and training strategies that respond to data difficulty in time-series forecasting.

Abstract

In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, this approach suffers from a fundamental flaw: standard evaluation metrics conflate a model's performance with the data's intrinsic unpredictability. To address this pressing challenge, we introduce a novel, predictability-aligned diagnostic framework grounded in spectral coherence. Our framework makes two primary contributions: the Spectral Coherence Predictability (SCP), a computationally efficient ($O(N\log N)$) and task-aligned score that quantifies the inherent difficulty of a given forecasting instance, and the Linear Utilization Ratio (LUR), a frequency-resolved diagnostic tool that precisely measures how effectively a model exploits the linearly predictable information within the data. We validate our framework's effectiveness and leverage it to reveal two core insights. First, we provide the first systematic evidence of "predictability drift", demonstrating that a task's forecasting difficulty varies sharply over time. Second, our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks. We advocate for a paradigm shift, moving beyond simplistic aggregate scores toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior.

Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting

TL;DR

This work addresses evaluation skew in time-series forecasting by separating model performance from data difficulty. It introduces Spectral Coherence Predictability (SCP), a scalable, task-aligned instance-difficulty proxy, and Linear Utilization Ratio (LUR), a frequency-resolved diagnostic of how well models exploit linearly predictable structure. Through toy and real-world experiments, SCP proves well-calibrated with empirical errors, reveals predictability drift over time, and uncovers architecture-dependent strengths via band-wise analyses. The results advocate shifting from single-score leaderboards to predictability-aware evaluation to enable fairer comparisons and deeper insights into model behavior. Overall, the framework lays groundwork for adaptive architectures and training strategies that respond to data difficulty in time-series forecasting.

Abstract

In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, this approach suffers from a fundamental flaw: standard evaluation metrics conflate a model's performance with the data's intrinsic unpredictability. To address this pressing challenge, we introduce a novel, predictability-aligned diagnostic framework grounded in spectral coherence. Our framework makes two primary contributions: the Spectral Coherence Predictability (SCP), a computationally efficient () and task-aligned score that quantifies the inherent difficulty of a given forecasting instance, and the Linear Utilization Ratio (LUR), a frequency-resolved diagnostic tool that precisely measures how effectively a model exploits the linearly predictable information within the data. We validate our framework's effectiveness and leverage it to reveal two core insights. First, we provide the first systematic evidence of "predictability drift", demonstrating that a task's forecasting difficulty varies sharply over time. Second, our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks. We advocate for a paradigm shift, moving beyond simplistic aggregate scores toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior.

Paper Structure

This paper contains 37 sections, 43 equations, 7 figures, 6 tables, 4 algorithms.

Figures (7)

  • Figure 1: Calibration of SCP against Model Error. (a) Synthetic Validation: MSE of the best linear predictor on a synthetic Gaussian process with varying SNR. (b) Real-World Alignment: Average performance of state-of-the-art prediction models on real datasets. We report normalized MSE (NMSE), obtained by dividing MSE by the corresponding variance.
  • Figure 2: Per-variable scatter plots on Weather (a) and ECL (b) comparing the estimated MSE lower bound (MSE$_{\mathrm{lb}}$) with iTransformer’s prediction error (MSE)$_{\mathrm{model}}$.
  • Figure 3: Visualizing Predictability Drift. ETTh1 test set, channel 1, horizon $N=336$. Relationship between per-sample linearly predictable energy (decomposed by frequency band as a share of total) and the corresponding MSE of DLinear and PatchTST.
  • Figure 4: Band-wise analysis on ETTh1, representative channel. Normalized energy shares and LUR across frequency bands for three models.
  • Figure 5: ETTh1 dataset with forecasting length $N=96$: per-channel evaluation stratified by predictability $\mathcal{P}$. Samples are grouped into equal-width $\mathcal{P}$ bins; each point reports the mean MSE within the bin.
  • ...and 2 more figures