Table of Contents
Fetching ...

Pontryagin-Guided Policy Optimization for Merton's Portfolio Problem

Jeonggyu Huh, Jaegi Jeon

Abstract

We present a Pontryagin-Guided Direct Policy Optimization (PG-DPO) framework for Merton's portfolio problem, unifying modern neural-network-based policy parameterization with the adjoint viewpoint from Pontryagin's maximum principle (PMP). Instead of approximating the value function (as done in deep BSDE methods), we track a policy-fixed BSDE for the adjoint processes, which allows each gradient update to align with continuous-time PMP conditions. This setup yields locally optimal consumption and investment policies that are closely tied to classical stochastic control. We further incorporate an alignment penalty that nudges the learned policy toward Pontryagin-derived solutions, enhancing both convergence speed and training stability. Numerical experiments confirm that PG-DPO effectively handles both consumption and investment, achieving strong performance and interpretability without requiring large offline datasets or model-free reinforcement learning.

Pontryagin-Guided Policy Optimization for Merton's Portfolio Problem

Abstract

We present a Pontryagin-Guided Direct Policy Optimization (PG-DPO) framework for Merton's portfolio problem, unifying modern neural-network-based policy parameterization with the adjoint viewpoint from Pontryagin's maximum principle (PMP). Instead of approximating the value function (as done in deep BSDE methods), we track a policy-fixed BSDE for the adjoint processes, which allows each gradient update to align with continuous-time PMP conditions. This setup yields locally optimal consumption and investment policies that are closely tied to classical stochastic control. We further incorporate an alignment penalty that nudges the learned policy toward Pontryagin-derived solutions, enhancing both convergence speed and training stability. Numerical experiments confirm that PG-DPO effectively handles both consumption and investment, achieving strong performance and interpretability without requiring large offline datasets or model-free reinforcement learning.

Paper Structure

This paper contains 31 sections, 2 theorems, 58 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Theorem 1

Suppose (A1)--(A3) hold, and that $\|\delta_{k}\|\to0$ as $\Delta t\to0$. Then, with probability one, where $\nabla J(\theta^{\dagger},\phi^{\dagger})=0$. In other words, $(\theta^{\dagger},\phi^{\dagger})$ is a stationary point of $J$ in the parameter space.

Figures (1)

  • Figure 1: Neural vs. exact solutions for consumption/investment under the Merton model. The learned plots (left) are from our PG-DPO-Align run at iteration 100,000; the exact plots (right) show the closed-form Merton solution.

Theorems & Definitions (4)

  • Theorem 1: Baseline Robbins--Monro Convergence
  • proof : Proof Sketch
  • Theorem 2: Stationarity of $\widetilde{J}$
  • proof : Proof Sketch