Table of Contents
Fetching ...

Same Error, Different Function: The Optimizer as an Implicit Prior in Financial Time Series

Federico Vittorio Cortesi, Giuseppe Iannone, Giulia Crippa, Tomaso Poggio, Pierfrancesco Beneventano

TL;DR

Using large-scale volatility forecasting for S$\&$P 500 stocks, it is shown that different model-training-pipeline pairs with identical test loss learn qualitatively different functions, and model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.

Abstract

Neural networks applied to financial time series operate in a regime of underspecification, where model predictors achieve indistinguishable out-of-sample error. Using large-scale volatility forecasting for S$\&$P 500 stocks, we show that different model-training-pipeline pairs with identical test loss learn qualitatively different functions. Across architectures, predictive accuracy remains unchanged, yet optimizer choice reshapes non-linear response profiles and temporal dependence differently. These divergences have material consequences for decisions: volatility-ranked portfolios trace a near-vertical Sharpe-turnover frontier, with nearly $3\times$ turnover dispersion at comparable Sharpe ratios. We conclude that in underspecified settings, optimization acts as a consequential source of inductive bias, thus model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.

Same Error, Different Function: The Optimizer as an Implicit Prior in Financial Time Series

TL;DR

Using large-scale volatility forecasting for SP 500 stocks, it is shown that different model-training-pipeline pairs with identical test loss learn qualitatively different functions, and model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.

Abstract

Neural networks applied to financial time series operate in a regime of underspecification, where model predictors achieve indistinguishable out-of-sample error. Using large-scale volatility forecasting for SP 500 stocks, we show that different model-training-pipeline pairs with identical test loss learn qualitatively different functions. Across architectures, predictive accuracy remains unchanged, yet optimizer choice reshapes non-linear response profiles and temporal dependence differently. These divergences have material consequences for decisions: volatility-ranked portfolios trace a near-vertical Sharpe-turnover frontier, with nearly turnover dispersion at comparable Sharpe ratios. We conclude that in underspecified settings, optimization acts as a consequential source of inductive bias, thus model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.
Paper Structure (86 sections, 14 equations, 24 figures, 3 tables)

This paper contains 86 sections, 14 equations, 24 figures, 3 tables.

Figures (24)

  • Figure 1: LSTM Response Surface. Visualization of the impulse response $\mathcal{R}(k, \delta)$. The LSTM (trained with Muon) exhibits a curved decision boundary, indicating distinct non-linear sensitivity to volatility shocks at specific lags.
  • Figure 2: Functional Divergence: Adaptive vs. SGD. Impulse response analysis of the CNN architecture at Lag $t-1$. The optimizer dictates the complexity of the learned function: Adam and Muon (Blue/Red) identify a complex non-linear dampening mechanism, whereas SGD (Green) reverts to a distinctively different function. All models achieve comparable predictive error, yet represent fundamentally different functional interpretations of the same data.
  • Figure 3: Functional Divergence Across Architectures. The difference surface $D = \hat{y}_{\text{Muon}} - \hat{y}_{\text{Adam}}$ plotted for LSTM (left), CNN (middle), and Transformer (right). All three architectures produce complex, non-flat difference landscapes, confirming that Adam and Muon settle into fundamentally different local minima regardless of the specific architecture used.
  • Figure 4: Mechanism of Divergence (SHAP Values). Feature attribution analysis comparing Adam and Muon optimizers across lags $t-1$ (Feature 99) to $t-100$ (Feature 0).
  • Figure 5: Edge of Stability Trace under SGD. Evolution of the maximum Hessian eigenvalue $\lambda_{max}$ during SGD training for the MLP architecture relative to the stability threshold $2/\eta$. Sharpness rises until it equilibrates at the edge of instability. Financial neural networks exhibit EoS behavior where sharpness tracks the stability limit.
  • ...and 19 more figures