Table of Contents
Fetching ...

A Controlled Comparison of Deep Learning Architectures for Multi-Horizon Financial Forecasting: Evidence from 918 Experiments

Nabeel Ahmad Saidd

Abstract

Multi-horizon price forecasting is central to portfolio allocation, risk management, and algorithmic trading, yet deep learning architectures have proliferated faster than rigorous financial benchmarks can evaluate them. This study provides a controlled comparison of nine architectures (Autoformer, DLinear, iTransformer, LSTM, ModernTCN, N-HiTS, PatchTST, TimesNet, and TimeXer) spanning Transformer, MLP, CNN, and RNN families across cryptocurrency, forex, and equity index markets at 4-hour and 24-hour horizons. A total of 918 experiments were conducted under a strict five-stage protocol including fixed-seed Bayesian hyperparameter optimization, configuration freezing per asset class, multi-seed retraining, uncertainty aggregation, and statistical validation. ModernTCN achieves the best mean rank (1.333) with a 75 percent first-place rate, followed by PatchTST (2.000). Results reveal a clear three-tier ranking structure and show that architecture explains nearly all performance variance, while seed randomness is negligible. Rankings remain stable across horizons despite 2 to 2.5 times error amplification. Directional accuracy remains near 50 percent across all configurations, indicating that MSE-trained models lack directional skill at hourly resolution. The findings highlight the importance of architectural inductive bias over raw parameter count and provide reproducible guidance for multi-step financial forecasting.

A Controlled Comparison of Deep Learning Architectures for Multi-Horizon Financial Forecasting: Evidence from 918 Experiments

Abstract

Multi-horizon price forecasting is central to portfolio allocation, risk management, and algorithmic trading, yet deep learning architectures have proliferated faster than rigorous financial benchmarks can evaluate them. This study provides a controlled comparison of nine architectures (Autoformer, DLinear, iTransformer, LSTM, ModernTCN, N-HiTS, PatchTST, TimesNet, and TimeXer) spanning Transformer, MLP, CNN, and RNN families across cryptocurrency, forex, and equity index markets at 4-hour and 24-hour horizons. A total of 918 experiments were conducted under a strict five-stage protocol including fixed-seed Bayesian hyperparameter optimization, configuration freezing per asset class, multi-seed retraining, uncertainty aggregation, and statistical validation. ModernTCN achieves the best mean rank (1.333) with a 75 percent first-place rate, followed by PatchTST (2.000). Results reveal a clear three-tier ranking structure and show that architecture explains nearly all performance variance, while seed randomness is negligible. Rankings remain stable across horizons despite 2 to 2.5 times error amplification. Directional accuracy remains near 50 percent across all configurations, indicating that MSE-trained models lack directional skill at hourly resolution. The findings highlight the importance of architectural inductive bias over raw parameter count and provide reproducible guidance for multi-step financial forecasting.
Paper Structure (130 sections, 2 equations, 39 figures, 19 tables)

This paper contains 130 sections, 2 equations, 39 figures, 19 tables.

Figures (39)

  • Figure 1: Representative hourly Close-price time series for one asset per class: BTC/USDT (cryptocurrency), EUR/USD (forex), and Dow Jones (equity indices). Vertical dashed lines indicate chronological train/validation/test boundaries (70/15/15 split). The three classes exhibit qualitatively different dynamics: high-volatility trending behaviour (cryptocurrency), low-volatility mean-reversion around a narrow range (forex), and moderate-volatility upward drift (equity indices). All series comprise the most recent 30,000 hourly observations.
  • Figure 2: Five-stage experimental pipeline. Stage 1: Fixed-seed Bayesian HPO on representative assets (BTC/USDT, EUR/USD, Dow Jones; seed 42; 5 Optuna TPE trials; 50 epochs per trial). Stage 2: Best configuration frozen per (model, category, horizon) triple. Stage 3: Multi-seed final training (seeds 123, 456, 789; 100 epochs maximum; early stopping with patience 15). Stage 4: Test-set metric aggregation with inverse scaling (mean $\pm$ std across seeds). Stage 5: Benchmarking with rank-based leaderboard analysis, visualisation, and variance decomposition. All 918 experimental runs---270 HPO trials plus 648 final training runs---are conducted under identical conditions.
  • Figure 3: Global rmse heatmap across eight modern architectures and 24 evaluation points (12 assets $\times$ 2 horizons). Lighter cells indicate lower error. ModernTCN and PatchTST consistently achieve the lowest rmse values across all asset--horizon combinations. LSTM is excluded for visual clarity; the full nine-model variant is provided in Appendix \ref{['sec:app_dual_plots']}, Figure \ref{['fig:app_global_heatmap_all']}. Values represent mean rmse across three seeds.
  • Figure 4: Global mean rank comparison across 24 evaluation points (12 assets $\times$ 2 horizons). Lower values indicate better performance. Three distinct tiers are visible: ModernTCN and PatchTST (ranks 1--2), a middle group of four models (ranks 3--6), and a bottom group comprising TimesNet, Autoformer, and LSTM (ranks 7--9). Error bars represent rank standard deviation across evaluation points.
  • Figure 5: Category-level rank distributions across assets within each category, excluding LSTM for visual clarity. ModernTCN exhibits the tightest rank distribution (consistently rank 1 across all categories), indicating stable cross-asset performance. The full nine-model variant is provided in Appendix \ref{['sec:app_dual_plots']}. Boxes show interquartile range; whiskers extend to the most extreme rank observed.
  • ...and 34 more figures