Are We Winning the Wrong Game? Revisiting Evaluation Practices for Long-Term Time Series Forecasting

Thanapol Phungtua-eng; Yoshitaka Yamamoto

Are We Winning the Wrong Game? Revisiting Evaluation Practices for Long-Term Time Series Forecasting

Thanapol Phungtua-eng, Yoshitaka Yamamoto

TL;DR

A multi-dimensional evaluation perspective is proposed that integrates statistical fidelity, structural coherence, and decision-level relevance in long-term time series forecasting, and redirects attention from winning benchmark tables toward advancing meaningful, context-aware forecasting.

Abstract

Long-term time series forecasting (LTSF) is widely recognized as a central challenge in data mining and machine learning. LTSF has increasingly evolved into a benchmark-driven ''GAME,'' where models are ranked, compared, and declared state-of-the-art based primarily on marginal reductions in aggregated pointwise error metrics such as MSE and MAE. Across a small set of canonical datasets and fixed forecasting horizons, progress is communicated through leaderboard-style tables in which lower numerical scores define success. In this GAME, what is measured becomes what is optimized, and incremental error reduction becomes the dominant currency of advancement. We argue that this metric-centric regime is not merely incomplete, but structurally misaligned with the broader objectives of forecasting. In real-world settings, forecasting often prioritizes preserving temporal structure, trend stability, seasonal coherence, robustness to regime shifts, and supporting downstream decision processes. Optimizing aggregate pointwise error does not necessarily imply modeling these structural properties. As a result, leaderboard improvement may increasingly reflect specialization in benchmark configurations rather than a deeper understanding of temporal dynamics. This paper revisits LTSF evaluation as a foundational question in data science: what does it mean to measure forecasting progress? We propose a multi-dimensional evaluation perspective that integrates statistical fidelity, structural coherence, and decision-level relevance. By challenging the current metric monoculture, we aim to redirect attention from winning benchmark tables toward advancing meaningful, context-aware forecasting.

Are We Winning the Wrong Game? Revisiting Evaluation Practices for Long-Term Time Series Forecasting

TL;DR

Abstract

Paper Structure (15 sections, 3 figures, 1 table)

This paper contains 15 sections, 3 figures, 1 table.

When Benchmark Success Becomes the Objective of Long-Term Time Series Forecasting
A Big and Bold Question: Is the Evaluation Game Aligned with Forecasting Objectives?
When Evaluation Shapes Objectives
Why This Question Matters Now
Benchmark-Driven Model Development
One Dataset, Multiple Interpretations
Incentive Effects and Metric Dominance (Forecasting $\neq$ Curve Fitting)
How Does This Paper Push the Frontier?
Beyond Pointwise Error Metrics
From Leaderboard Ranking to Diagnostic Reporting
No Universal Champion: Domain-Specific and Context-Dependent Performance
What Would Success Look Like?
Suplementary Materials
Implementation Details
Window-Level Diagnostics

Figures (3)

Figure 1: Forecasting results for the oil temperature (OT) series in the ETTm2 dataset using the DLinear Zeng2023dlinear model. The figure shows three representative sliding windows: the highest MSE (Worst), a window closest to the mean MSE (Closest), and the lowest MSE (Best). For each window, recent input observations before the cutoff are plotted together with the ground truth and forecasted values over the prediction horizon. The figure highlights the variability of forecasting performance across windows that may be hidden by aggregated error metrics.
Figure 2: Forecasting results for the oil temperature (OT) series in the ETTm2 dataset across models. The figure shows three representative sliding windows: the highest MSE (Worst), a window closest to the mean MSE (Closest), and the lowest MSE (Best). For each window, recent input observations before the cutoff are plotted together with the ground truth and forecasted values over the prediction horizon. The figure highlights the variability of forecasting performance across windows that may be hidden by aggregated error metrics.
Figure 3: Forecasting results for all variables in the ETTm2 dataset across models.The figure shows three representative sliding windows: the highest MSE (Worst), a window closest to the mean MSE (Closest), and the lowest MSE (Best). For each window, recent input observations before the cutoff are plotted together with the ground truth and forecasted values over the prediction horizon. The figure highlights the variability of forecasting performance across windows that may be hidden by aggregated error metrics.

Are We Winning the Wrong Game? Revisiting Evaluation Practices for Long-Term Time Series Forecasting

TL;DR

Abstract

Are We Winning the Wrong Game? Revisiting Evaluation Practices for Long-Term Time Series Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (3)