Table of Contents
Fetching ...

Deep Learning for Financial Time Series: A Large-Scale Benchmark of Risk-Adjusted Performance

Adir Saly-Kaufmann, Kieran Wood, Jan Peter-Calliess, Stefan Zohren

Abstract

We present a large scale benchmark of modern deep learning architectures for a financial time series prediction and position sizing task, with a primary focus on Sharpe ratio optimization. Evaluating linear models, recurrent networks, transformer based architectures, state space models, and recent sequence representation approaches, we assess out of sample performance on a daily futures dataset spanning commodities, equity indices, bonds, and FX spanning 2010 to 2025. Our evaluation goes beyond average returns and includes statistical significance, downside and tail risk measures, breakeven transaction cost analysis, robustness to random seed selection, and computational efficiency. We find that models explicitly designed to learn rich temporal representations consistently outperform linear benchmarks and generic deep learning models, which often lead the ranking in standard time series benchmarks. Hybrid models such as VSN with LSTM, a combination of Variable Selection Networks (VSN) and LSTMs, achieves the highest overall Sharpe ratio, while VSN with xLSTM and LSTM with PatchTST exhibit superior downside adjusted characteristics. xLSTM demonstrates the largest breakeven transaction cost buffer, indicating improved robustness to trading frictions.

Deep Learning for Financial Time Series: A Large-Scale Benchmark of Risk-Adjusted Performance

Abstract

We present a large scale benchmark of modern deep learning architectures for a financial time series prediction and position sizing task, with a primary focus on Sharpe ratio optimization. Evaluating linear models, recurrent networks, transformer based architectures, state space models, and recent sequence representation approaches, we assess out of sample performance on a daily futures dataset spanning commodities, equity indices, bonds, and FX spanning 2010 to 2025. Our evaluation goes beyond average returns and includes statistical significance, downside and tail risk measures, breakeven transaction cost analysis, robustness to random seed selection, and computational efficiency. We find that models explicitly designed to learn rich temporal representations consistently outperform linear benchmarks and generic deep learning models, which often lead the ranking in standard time series benchmarks. Hybrid models such as VSN with LSTM, a combination of Variable Selection Networks (VSN) and LSTMs, achieves the highest overall Sharpe ratio, while VSN with xLSTM and LSTM with PatchTST exhibit superior downside adjusted characteristics. xLSTM demonstrates the largest breakeven transaction cost buffer, indicating improved robustness to trading frictions.
Paper Structure (98 sections, 49 equations, 27 figures, 11 tables)

This paper contains 98 sections, 49 equations, 27 figures, 11 tables.

Figures (27)

  • Figure 1: End-to-end portfolio optimization pipeline: Statistical and technical indicators are extracted from historical close prices, serving as the predictive model's inputs. The model outputs are transformed into portfolio weights via a linear projection followed by a hyperbolic tangent activation. Training is performed by minimizing the negative Sharpe Ratio.
  • Figure 2: Performance comparison across models 10% volatility-rescaled gross PnL.
  • Figure 3: Distribution of daily returns. To make the central mass visible, the figure focuses on the bulk of the distribution; tail behavior is examined separately.
  • Figure 4: Distribution of daily realized volatility (log scale). Volatility exhibits strong right skewness and a long upper tail.
  • Figure 5: Left: Quantile--quantile plot against the normal distribution. Right: Tail behavior of daily returns. The figures indicate substantial deviations from Gaussianity and heavy-tailed return dynamics.
  • ...and 22 more figures