An Efficient On-Policy Deep Learning Framework for Stochastic Optimal Control

Mengjian Hua; Mathieu Laurière; Eric Vanden-Eijnden

An Efficient On-Policy Deep Learning Framework for Stochastic Optimal Control

Mengjian Hua, Mathieu Laurière, Eric Vanden-Eijnden

TL;DR

This work tackles the scalability bottlenecks of differentiating through controlled SDEs in stochastic optimal control by coupling the Girsanov change of measure with a simulation-free on-policy gradient. It derives a gradient formula that evaluates on the actual control path without backpropagating through trajectory solutions and offers a practical alternative objective amenable to automatic differentiation. The framework is applied to (i) constructing Föllmer processes for sampling from unnormalized distributions and (ii) fine-tuning diffusion-model drifts to tilt target distributions, achieving substantial reductions in memory and computation compared to SDE-differentiation baselines. Overall, the method provides a scalable, memory-efficient paradigm for training neural controllers in SOC and related probabilistic modeling tasks with real-world impact in sampling and generative modeling.

Abstract

We present a novel on-policy algorithm for solving stochastic optimal control (SOC) problems. By leveraging the Girsanov theorem, our method directly computes on-policy gradients of the SOC objective without expensive backpropagation through stochastic differential equations or adjoint problem solutions. This approach significantly accelerates the optimization of neural network control policies while scaling efficiently to high-dimensional problems and long time horizons. We evaluate our method on classical SOC benchmarks as well as applications to sampling from unnormalized distributions via Schrödinger-Föllmer processes and fine-tuning pre-trained diffusion models. Experimental results demonstrate substantial improvements in both computational speed and memory efficiency compared to existing approaches.

An Efficient On-Policy Deep Learning Framework for Stochastic Optimal Control

TL;DR

Abstract

Paper Structure (20 sections, 7 theorems, 54 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 7 theorems, 54 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Methods
Problem setup
Reformulation with Girsanov theorem
Gradient computation
Alternative objective for implementation
Application to sampling via construction of a Föllmer process
Application to fine-tuning
Experiments
Sampling from an unnormalized distribution
Fine-tuning on the $\varphi^4$ model
Numerical Results:
Conclusion
Proofs of Propositions \ref{['th:grad']}, \ref{['th:fine:tune']} and \ref{['th:fine:tune:2']}
...and 5 more sections

Key Result

Lemma 1

Given a reference control $v \in {\mathcal{U}}$, the objective in (eq:obj) can be expressed as where $X^v=(X^v_t)_{t\in[0,T]}$ solves the SDE (eq:sde) with $u$ replaced by $v$, $M(u,v)$ is the Girsanov factor and the expectation $\mathbb{E}$ in (eq:obj:G) is taken over the law of $X^v$.

Figures (4)

Figure 1: Funnel distribution example ($\sigma = 1$): The samples from the funnel distribution (left panel), Path Integral Samplers (middle panel) and our method (right panel). To plot the samples of the ten-dimensional funnel distribution in 2D, we use the independence of its coordinates $\{x_1,\cdots, x_9\}$, squeeze these nine dimensions into one coordinate, and keep the first dimension $x_0$.
Figure 2: Top Panel: Histograms of the average magnetization of $10000$ lattice configurations, sampled with the trained control of our method, the vanilla method, Langevin dynamics with $E(\varphi)$, and the reference PDF $\rho_0(\varphi)$. We use samples obtained by running Langevin dynamics with the target potential $E(\varphi)$ as the ground-truth. (See Appendix \ref{['app:phi_4']} for more details about how to sample with Langevin dynamics.) Note that the results of our method is closer to the Langevin target than the vanilla method, which has adopts a much smaller model and fails to capture the statistics. Bottom Panel: The empirical loss function over GPU compute time. Our method enables a much larger model under the same memory budget and therefore presents a much better learning curve as compared to the vanilla method .
Figure 3: Linear Ornstein-Uhlenbeck Example: Our method outperforms the vanilla method in terms of convergence rate measured by the squared $L^2$ error (top panel) and the training loss (bottom panel).
Figure 4: Quadratic Ornstein-Uhlenbeck Example: $L^2_2$ error in terms of the number of training iterations (left panel) and the GPU compute time (right panel).

Theorems & Definitions (10)

Lemma 1
Proposition 1
Proposition 2
Proposition 3
Proposition 4
Proposition 5
Proposition 6
proof : Proof of Proposition \ref{['th:grad']}
proof : Proof of Proposition \ref{['th:fine:tune']}
proof : Proof of Proposition \ref{['th:fine:tune:2']}

An Efficient On-Policy Deep Learning Framework for Stochastic Optimal Control

TL;DR

Abstract

An Efficient On-Policy Deep Learning Framework for Stochastic Optimal Control

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (10)