Table of Contents
Fetching ...

Convergence of Neural Network Policies for Risk--Reward Optimization

Chang Chen, Duy-Minh Dang

TL;DR

Under mild regularity conditions, it is proved that the empirical optimum of the NN-parametrized objective converges in probability to the true optimal value as network capacity and training sample size increase.

Abstract

We develop a neural-network framework for multi-period risk--reward stochastic control problems with constrained two-step feedback policies that may be discontinuous in the state. We allow a broad class of objectives built on a finite-dimensional performance vector, including terminal and path-dependent statistics, with risk functionals admitting auxiliary-variable optimization representations (e.g.\ Conditional Value-at-Risk and buffered probability of exceedance) and optional moment dependence. Our approach parametrizes the two-step policy using two coupled feedforward networks with constraint-enforcing output layers, reducing the constrained control problem to unconstrained training over network parameters. Under mild regularity conditions, we prove that the empirical optimum of the NN-parametrized objective converges in probability to the true optimal value as network capacity and training sample size increase. The proof is modular, separating policy approximation, propagation through the controlled recursion, and preservation under the scalarized risk--reward objective. Numerical experiments confirm the predicted convergence-in-probability behavior, show close agreement between learned and reference control heat maps, and demonstrate out-of-sample robustness on a large independent scenario set.

Convergence of Neural Network Policies for Risk--Reward Optimization

TL;DR

Under mild regularity conditions, it is proved that the empirical optimum of the NN-parametrized objective converges in probability to the true optimal value as network capacity and training sample size increase.

Abstract

We develop a neural-network framework for multi-period risk--reward stochastic control problems with constrained two-step feedback policies that may be discontinuous in the state. We allow a broad class of objectives built on a finite-dimensional performance vector, including terminal and path-dependent statistics, with risk functionals admitting auxiliary-variable optimization representations (e.g.\ Conditional Value-at-Risk and buffered probability of exceedance) and optional moment dependence. Our approach parametrizes the two-step policy using two coupled feedforward networks with constraint-enforcing output layers, reducing the constrained control problem to unconstrained training over network parameters. Under mild regularity conditions, we prove that the empirical optimum of the NN-parametrized objective converges in probability to the true optimal value as network capacity and training sample size increase. The proof is modular, separating policy approximation, propagation through the controlled recursion, and preservation under the scalarized risk--reward objective. Numerical experiments confirm the predicted convergence-in-probability behavior, show close agreement between learned and reference control heat maps, and demonstrate out-of-sample robustness on a large independent scenario set.
Paper Structure (52 sections, 13 theorems, 119 equations, 3 figures, 8 tables)

This paper contains 52 sections, 13 theorems, 119 equations, 3 figures, 8 tables.

Key Result

Theorem 3.1

Let $X$ be an $\mathbb{R}^{\nu_0}$-valued random variable and let $f:\mathbb{R}^{\nu_0}\to\mathbb{R}^{d}$ be Borel measurable. Then there exists a sequence $\{F_{n}\}_{n\in\mathbb{N}}$, where $F_{n}=F(\cdot;\theta_{n})\in \mathcal{Q}_{n}$, such that for all $\varepsilon>0$,

Figures (3)

  • Figure 5.1: Empirical optima $\widehat{V}^{(j)}_{n,K}$ across $N_{\mathrm{run}}=100$ runs. Boxes show the interquartile range (25%--75%) with median line; whiskers extend to $1.5\times\mathrm{IQR}$ and points beyond are plotted as outliers. The dashed line indicates the reference value $V_{\mathrm{ref}}=1605.22$.
  • Figure 5.2: Policy heatmap comparison.
  • Figure 5.3: Withdrawal slice at $t=15$ years.

Theorems & Definitions (23)

  • Remark 2.1: Well-posedness of the controlled recursion
  • Remark 2.2: Pre-commitment vs. time-consistent formulations
  • Definition 3.1: Feedforward neural network
  • Theorem 3.1: Universal approximation for a random input hornik1989multilayer
  • Lemma 3.2: Composition with (a.s.-continuous) activations
  • Lemma 3.3: Boundary approximation via open-range activations
  • Theorem 3.4
  • Theorem 3.5
  • Remark 3.6: Training algorithm
  • Lemma 4.1
  • ...and 13 more