Table of Contents
Fetching ...

Large (and Deep) Factor Models

Bryan Kelly, Boris Kuznetsov, Semyon Malamud, Teng Andrea Xu

TL;DR

The paper establishes a theoretical bridge between wide, deep neural networks trained to maximize the Sharpe ratio of the stochastic discount factor and large factor models, via Neural Tangent Kernel theory. It shows that, in the infinite-width limit, DNN-SDFs admit a closed-form, kernel-based representation (the LFM-SDF) built from an extensive portfolio of non-linear characteristics and past market states. Depth and initialization embed inductive biases that interact with data availability to determine out-of-sample performance, with spectral regularization emerging from gradient descent dynamics. Empirically, deeper networks yield meaningful out-of-sample improvements when data are ample (e.g., longer rolling windows), supporting the depth-complexity narrative, while shallower models may suffice with limited data; nonetheless, in the kernel regime, the DNN-SDF behaves like an analyzable factor model with interpretable kernel-based features.

Abstract

We open up the black box behind Deep Learning for portfolio optimization and prove that a sufficiently wide and arbitrarily deep neural network (DNN) trained to maximize the Sharpe ratio of the Stochastic Discount Factor (SDF) is equivalent to a large factor model (LFM): A linear factor pricing model that uses many non-linear characteristics. The nature of these characteristics depends on the architecture of the DNN in an explicit, tractable fashion. This makes it possible to derive end-to-end trained DNN-based SDFs in closed form for the first time. We evaluate LFMs empirically and show how various architectural choices impact SDF performance. We document the virtue of depth complexity: With enough data, the out-of-sample performance of DNN-SDF is increasing in the NN depth, saturating at huge depths of around 100 hidden layers.

Large (and Deep) Factor Models

TL;DR

The paper establishes a theoretical bridge between wide, deep neural networks trained to maximize the Sharpe ratio of the stochastic discount factor and large factor models, via Neural Tangent Kernel theory. It shows that, in the infinite-width limit, DNN-SDFs admit a closed-form, kernel-based representation (the LFM-SDF) built from an extensive portfolio of non-linear characteristics and past market states. Depth and initialization embed inductive biases that interact with data availability to determine out-of-sample performance, with spectral regularization emerging from gradient descent dynamics. Empirically, deeper networks yield meaningful out-of-sample improvements when data are ample (e.g., longer rolling windows), supporting the depth-complexity narrative, while shallower models may suffice with limited data; nonetheless, in the kernel regime, the DNN-SDF behaves like an analyzable factor model with interpretable kernel-based features.

Abstract

We open up the black box behind Deep Learning for portfolio optimization and prove that a sufficiently wide and arbitrarily deep neural network (DNN) trained to maximize the Sharpe ratio of the Stochastic Discount Factor (SDF) is equivalent to a large factor model (LFM): A linear factor pricing model that uses many non-linear characteristics. The nature of these characteristics depends on the architecture of the DNN in an explicit, tractable fashion. This makes it possible to derive end-to-end trained DNN-based SDFs in closed form for the first time. We evaluate LFMs empirically and show how various architectural choices impact SDF performance. We document the virtue of depth complexity: With enough data, the out-of-sample performance of DNN-SDF is increasing in the NN depth, saturating at huge depths of around 100 hidden layers.
Paper Structure (14 sections, 11 theorems, 82 equations, 7 figures)

This paper contains 14 sections, 11 theorems, 82 equations, 7 figures.

Key Result

Lemma 1

Denote $F\in {\mathbb R}^{P\times T}$ a matrix of in-sample factor returns. Let also ${\bf 1}=(1,\cdots,1)\in {\mathbb R}^T$ be the vector of ones. Then,

Figures (7)

  • Figure 1: The figure above shows $f(x; \theta; W)$, a mathematical representation of a neural network with a single hidden layer neural network, also known as a shallow network. The weights $W \in {\mathbb R}^{d\times P}$ and $\theta \in {\mathbb R}^{P}$ are randomly initialized. $\phi: {\mathbb R} \rightarrow {\mathbb R}$ is an elementwise non-linear activation function and the vector $\phi(x'W)$ is also known as random features.
  • Figure 2: RW=12, act=ReLU, $\alpha$=0.5. The figure above presents Sharpe ratios and t-statistics of alpha intercepts for the NTK and NNGP kernel portfolios from 1993/03 to 2022/11 as functions of the depth of the neural network underlying the kernels for various values of shrinkage parameters $z$. NTK stands for the NTK kernel portfolio, and NNGP stands for the NNGP kernel portfolio. Depth is the number of inner layers of the neural network. The t-statistics for alpha intercepts are derived from OLS regressions of the monthly returns of NTK (NNGP) kernel portfolios on Fama-French factors and the complexity factor. Fama-French factors include $R_m - R_f$, HML, SMB, Monthly 2-12 Momentum, Short-Term and Long-Term Reversal (as described in Kenneth R. French data library).
  • Figure 3: RW=12, act=ReLU, $\alpha$=0.5. The figure above shows t-statistics of alpha intercepts for the NTK (NNGP) kernel portfolio with respect to the NNGP (NTK) kernel portfolio from 1993/03 to 2022/11 as functions of the depth of the neural network underlying the kernels for various values of shrinkage parameters $z$. NTK stands for the NTK kernel portfolios, and NNGP stands for the NNGP kernel portfolios. Depth is the number of inner layers of the neural network.
  • Figure 4: RW=60, act=ReLU, $\alpha$=0.5. The figure above presents Sharpe ratios and t-statistics of alpha intercepts for the NTK and NNGP kernel portfolios from 1993/03 to 2022/11 as functions of the depth of the neural network underlying the kernels for various values of shrinkage parameters $z$. NTK stands for the NTK kernel portfolio, and NNGP stands for the NNGP kernel portfolio. Depth is the number of inner layers of the neural network. The t-statistics for alpha intercepts are derived from OLS regressions of the monthly returns of NTK (NNGP) kernel portfolios on Fama-French factors and the complexity factor. Fama-French factors include $R_m - R_f$, HML, SMB, Monthly 2-12 Momentum, Short-Term and Long-Term Reversal (as described in Kenneth R. French data library).
  • Figure 5: RW=60, act=ReLU, $\alpha$=0.5. The figure above shows t-statistics of alpha intercepts for the NTK (NNGP) kernel portfolio with respect to the NNGP (NTK) kernel portfolio from 1993/03 to 2022/11 as functions of the depth of the neural network underlying the kernels for various values of shrinkage parameters $z$. NTK stands for the NTK kernel portfolios, and NNGP stands for the NNGP kernel portfolios. Depth is the number of inner layers of the neural network.
  • ...and 2 more figures

Theorems & Definitions (16)

  • Definition 1: The Portfolio Kernel
  • Lemma 1
  • Theorem 1: LFM-SDF
  • Corollary 2
  • Definition 2: Multi-Layer Perceptron (MLP)
  • Definition 3: The Neural Network Gaussian Process (NNGP) Kernel
  • Definition 4: The Infinite Width NTK
  • Theorem 3: NTK is constant through training for wide NNs jacot2018neural
  • Definition 5: The Portfolio Tangent Kernel (PTK)
  • Corollary 4: Infinite Width PTK
  • ...and 6 more