Offline Supervised Learning V.S. Online Direct Policy Optimization: A Comparative Study and A Unified Training Paradigm for Neural Network-Based Optimal Feedback Control

Yue Zhao; Jiequn Han

Offline Supervised Learning V.S. Online Direct Policy Optimization: A Comparative Study and A Unified Training Paradigm for Neural Network-Based Optimal Feedback Control

Yue Zhao, Jiequn Han

TL;DR

This work benchmarks two neural-network-based strategies for optimal feedback control: offline supervised learning (SL), which learns from precomputed open-loop solutions, and online direct policy optimization (DO), which optimizes a closed-loop policy with respect to the original OCP. It shows that SL generally achieves near-optimal performance with far lower training time, while DO benefits from good initialization but is sensitive to optimization landscapes, especially over longer horizons. To leverage the strengths of both, the authors propose a unified Pre-train and Fine-tune paradigm: pre-train with SL on an open-loop dataset and then fine-tune online with DO, yielding improved performance and robustness across challenging tasks. Experiments on satellite attitude control and quadrotor landing confirm that SL outperforms DO in many settings, and fine-tuning consistently enhances robustness, reducing the data and time required to approach optimal control. The work provides practical guidance for training neural network-based optimal feedback controllers and releases code to facilitate replication and further research.

Abstract

This work is concerned with solving neural network-based feedback controllers efficiently for optimal control problems. We first conduct a comparative study of two prevalent approaches: offline supervised learning and online direct policy optimization. Albeit the training part of the supervised learning approach is relatively easy, the success of the method heavily depends on the optimal control dataset generated by open-loop optimal control solvers. In contrast, direct policy optimization turns the optimal control problem into an optimization problem directly without any requirement of pre-computing, but the dynamics-related objective can be hard to optimize when the problem is complicated. Our results underscore the superiority of offline supervised learning in terms of both optimality and training time. To overcome the main challenges, dataset and optimization, in the two approaches respectively, we complement them and propose the Pre-train and Fine-tune strategy as a unified training paradigm for optimal feedback control, which further improves the performance and robustness significantly. Our code is accessible at https://github.com/yzhao98/DeepOptimalControl.

Offline Supervised Learning V.S. Online Direct Policy Optimization: A Comparative Study and A Unified Training Paradigm for Neural Network-Based Optimal Feedback Control

TL;DR

Abstract

Paper Structure (31 sections, 16 equations, 9 figures, 9 tables)

This paper contains 31 sections, 16 equations, 9 figures, 9 tables.

Introduction
Preliminaries and Related Works
Mathematical Formulation
Offline Supervised Learning
Data Generation
Objective
Online Direct Policy Optimization
Objective
Comparisons and A Unified Framework
Comparative Analysis
A Unified Training Paradigm
Experiments
Experimental Settings
The Optimal Altitude Control Problem of Satellite
Settings
...and 16 more sections

Figures (9)

Figure 1: A unified training paradigm for neural network-based closed-loop optimal control consists of two stages. In Stage I, we first solve corresponding open-loop OCP to generate a dataset, on which we train the controller through supervised learning. In Stage II, we fine-tune the controller pre-trained in Stage I through online direct policy optimization.
Figure 2: Cumulative distribution function of the cost ratio in the satellite problem with uniform disturbances of $\sigma=0.01, 0.025, 0.05$. The spans of the horizontal axis are different.
Figure 3: Comparative analysis of the optimization landscape in direct policy optimization and supervised learning. The figure displays, from left to right, normalized variations in different losses, normalized variations of $l_2$ changes in the gradient, and the normalized effective $\beta$-smoothness (the maximum ratio between gradient difference (in $l_2$-norm) and parameter difference) as moving in the gradient direction with different $step\_size$.
Figure 4: Cumulative distribution function of the cost ratio in quadrotor's optimal landing problem on $x_0 \in \tilde{\mathcal{S}}_{\text{quad}}$ with varying time horizons $T=4, 8, 16$. Note that the spans of the horizontal axis increase from left to right.
Figure 5: Cumulative distribution function of the cost ratio in quadrotor's optimal landing problem on $\bm{x}_0 \in \mathcal{S}_{\text{quad}}$ with $T=16$. Left and middle: closed-loop evaluation in deterministic environments where supervised learning is trained on the dataset with uniformly sampled initial states and the adaptive dataset generated by IVP-enhanced sampling respectively. Right: closed-loop evaluation in stochastic environments with $\sigma=0.25$ where supervised learning is trained on the adaptive dataset. The spans of the horizontal axis are different.
...and 4 more figures

Offline Supervised Learning V.S. Online Direct Policy Optimization: A Comparative Study and A Unified Training Paradigm for Neural Network-Based Optimal Feedback Control

TL;DR

Abstract

Offline Supervised Learning V.S. Online Direct Policy Optimization: A Comparative Study and A Unified Training Paradigm for Neural Network-Based Optimal Feedback Control

Authors

TL;DR

Abstract

Table of Contents

Figures (9)