Supervised Learning for Stochastic Optimal Control

Vince Kurtz; Joel W. Burdick

Supervised Learning for Stochastic Optimal Control

Vince Kurtz, Joel W. Burdick

TL;DR

This paper addresses the data scarcity challenge in learning controllers for continuous-time stochastic systems by recasting stochastic optimal control as a supervised regression problem. It linearizes the nonlinear HJB through the desirability transformation $V = -\lambda \log \Psi$ and uses the Feynman-Kac representation to generate training data offline via Monte Carlo simulations of a diffusion process. A neural-network approximation $\Psi_\theta$ is learned from these samples, enabling a policy $\pi_\theta(x) = \lambda R^{-1} G(x)^T \frac{\nabla_x \Psi_\theta(x)}{\Psi_\theta(x)}$ to be applied in real time. The approach is validated on a double integrator and a nonlinear pendulum, illustrating data-efficient, demonstration-free learning of stochastic optimal control policies with potential for GPU-accelerated scaling.

Abstract

Supervised machine learning is powerful. In recent years, it has enabled massive breakthroughs in computer vision and natural language processing. But leveraging these advances for optimal control has proved difficult. Data is a key limiting factor. Without access to the optimal policy, value function, or demonstrations, how can we fit a policy? In this paper, we show how to automatically generate supervised learning data for a class of continuous-time nonlinear stochastic optimal control problems. In particular, applying the Feynman-Kac theorem to a linear reparameterization of the Hamilton-Jacobi-Bellman PDE allows us to sample the value function by simulating a stochastic process. Hardware accelerators like GPUs could rapidly generate a large amount of this training data. With this data in hand, stochastic optimal control becomes supervised learning.

Supervised Learning for Stochastic Optimal Control

TL;DR

and uses the Feynman-Kac representation to generate training data offline via Monte Carlo simulations of a diffusion process. A neural-network approximation

is learned from these samples, enabling a policy

to be applied in real time. The approach is validated on a double integrator and a nonlinear pendulum, illustrating data-efficient, demonstration-free learning of stochastic optimal control policies with potential for GPU-accelerated scaling.

Abstract

Paper Structure (9 sections, 3 theorems, 32 equations, 4 figures, 1 algorithm)

This paper contains 9 sections, 3 theorems, 32 equations, 4 figures, 1 algorithm.

Introduction
Related Work
Problem Statement
Generating Training Data
A Supervised Learning Algorithm
Examples
Double Integrator
Pendulum
Conclusion and Future Work

Key Result

Theorem 1

Let $X_t$ be the solution to the stochastic differential equation where $x \in \mathbb{R}^n$, $s \in [0, T]$, and $W$ is standard Brownian noise. Then the viscosity solution of (eq:fk_pde, eq:fk_boundary) is given by Furthermore, if Eq.s (eq:fk_pde) and (eq:fk_boundary) admit a classical solution, then eq:fk_expectation provides that classical solution.

Figures (4)

Figure 1: Training data (top) and neural network fit (bottom) for an inverted pendulum swing up task. Light yellow indicates a high desirability score---a transformation of the value function. Training data is generated using only simulations with random inputs, but exhibits the characteristic "swirl" of the value function for this nonlinear system.
Figure 2: The true desirability function (yellow) and samples from the stochastic process \ref{['eq:desirability_sde']} (black) for a double integrator. For systems where a closed-form solution is not available, these samples can serve as training data.
Figure 3: Closed loop vector field (black arrows) and simulated rollout (red) under the learned policy $\pi_\theta(x)$ for an inverted pendulum. This policy is generated in seconds using supervised learning, without access to demonstrations.
Figure 4: Snapshots of the noisy desirability (log value function) targets $\hat{\Psi}^i$ generated by the diffusion process \ref{['eq:desirability_sde']}. These samples start out matching the terminal cost $\phi(x)$ at $T=0$ seconds. By $T=0.8$ seconds, the characteristic swirl of the value function is clearly visible.

Theorems & Definitions (7)

Remark 1
Theorem 1: Feynman-Kac
Theorem 2
proof
Theorem 3
proof
Remark 2

Supervised Learning for Stochastic Optimal Control

TL;DR

Abstract

Supervised Learning for Stochastic Optimal Control

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (7)