Same Error, Different Function: The Optimizer as an Implicit Prior in Financial Time Series

Federico Vittorio Cortesi; Giuseppe Iannone; Giulia Crippa; Tomaso Poggio; Pierfrancesco Beneventano

Same Error, Different Function: The Optimizer as an Implicit Prior in Financial Time Series

Federico Vittorio Cortesi, Giuseppe Iannone, Giulia Crippa, Tomaso Poggio, Pierfrancesco Beneventano

TL;DR

Using large-scale volatility forecasting for S$\&$P 500 stocks, it is shown that different model-training-pipeline pairs with identical test loss learn qualitatively different functions, and model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.

Abstract

Neural networks applied to financial time series operate in a regime of underspecification, where model predictors achieve indistinguishable out-of-sample error. Using large-scale volatility forecasting for S$\&$P 500 stocks, we show that different model-training-pipeline pairs with identical test loss learn qualitatively different functions. Across architectures, predictive accuracy remains unchanged, yet optimizer choice reshapes non-linear response profiles and temporal dependence differently. These divergences have material consequences for decisions: volatility-ranked portfolios trace a near-vertical Sharpe-turnover frontier, with nearly $3\times$ turnover dispersion at comparable Sharpe ratios. We conclude that in underspecified settings, optimization acts as a consequential source of inductive bias, thus model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.

Same Error, Different Function: The Optimizer as an Implicit Prior in Financial Time Series

TL;DR

Using large-scale volatility forecasting for S

P 500 stocks, it is shown that different model-training-pipeline pairs with identical test loss learn qualitatively different functions, and model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.

Abstract

P 500 stocks, we show that different model-training-pipeline pairs with identical test loss learn qualitatively different functions. Across architectures, predictive accuracy remains unchanged, yet optimizer choice reshapes non-linear response profiles and temporal dependence differently. These divergences have material consequences for decisions: volatility-ranked portfolios trace a near-vertical Sharpe-turnover frontier, with nearly

turnover dispersion at comparable Sharpe ratios. We conclude that in underspecified settings, optimization acts as a consequential source of inductive bias, thus model evaluation should extend beyond scalar loss to encompass functional and decision-level implications.

Paper Structure (86 sections, 14 equations, 24 figures, 3 tables)

This paper contains 86 sections, 14 equations, 24 figures, 3 tables.

Introduction
Model leaderboard ties.
Optimizer choice appears inconsequential.
Overview and takeaways.
Detailed contributions.
Experimental Framework
Task Definition: Volatility Forecasting
Model--Optimizer Pairs
Experimental grid.
Functional Diagnostics Beyond Error Metrics
Impulse Response Analysis.
Functional Difference Surfaces.
Feature Attribution via SHAP.
Orthogonality via Ensembling.
The Phenomenon of Predictive Equivalence
...and 71 more sections

Figures (24)

Figure 1: LSTM Response Surface. Visualization of the impulse response $\mathcal{R}(k, \delta)$. The LSTM (trained with Muon) exhibits a curved decision boundary, indicating distinct non-linear sensitivity to volatility shocks at specific lags.
Figure 2: Functional Divergence: Adaptive vs. SGD. Impulse response analysis of the CNN architecture at Lag $t-1$. The optimizer dictates the complexity of the learned function: Adam and Muon (Blue/Red) identify a complex non-linear dampening mechanism, whereas SGD (Green) reverts to a distinctively different function. All models achieve comparable predictive error, yet represent fundamentally different functional interpretations of the same data.
Figure 3: Functional Divergence Across Architectures. The difference surface $D = \hat{y}_{\text{Muon}} - \hat{y}_{\text{Adam}}$ plotted for LSTM (left), CNN (middle), and Transformer (right). All three architectures produce complex, non-flat difference landscapes, confirming that Adam and Muon settle into fundamentally different local minima regardless of the specific architecture used.
Figure 4: Mechanism of Divergence (SHAP Values). Feature attribution analysis comparing Adam and Muon optimizers across lags $t-1$ (Feature 99) to $t-100$ (Feature 0).
Figure 5: Edge of Stability Trace under SGD. Evolution of the maximum Hessian eigenvalue $\lambda_{max}$ during SGD training for the MLP architecture relative to the stability threshold $2/\eta$. Sharpness rises until it equilibrates at the edge of instability. Financial neural networks exhibit EoS behavior where sharpness tracks the stability limit.
...and 19 more figures

Same Error, Different Function: The Optimizer as an Implicit Prior in Financial Time Series

TL;DR

Abstract

Same Error, Different Function: The Optimizer as an Implicit Prior in Financial Time Series

Authors

TL;DR

Abstract

Table of Contents

Figures (24)