A Pontryagin Perspective on Reinforcement Learning
Onno Eberhard, Claire Vernade, Michael Muehlebach
TL;DR
This paper introduces open-loop reinforcement learning (OLRL), where the controller optimizes a fixed action sequence $u_{0:T-1}$ rather than a feedback policy, and provides convergence guarantees for three algorithms that operate with incomplete dynamics information.The authors derive a Pontryagin-based gradient framework, replacing Bellman-based DP, and present a model-based method with robustness guarantees, plus two model-free variants (on-trajectory and off-trajectory) that estimate the necessary Jacobians along trajectories via rollouts and online learning.Empirical results on an inverted pendulum and MuJoCo tasks demonstrate robust performance under model misspecification, high sample efficiency for the on-trajectory method, and strong convergence with the off-trajectory approach, sometimes rivaling closed-loop SAC baselines without function approximation.These findings highlight a viable alternative trajectory-optimization paradigm for RL in predictable dynamics, provide insights into the tradeoffs between model-based/model-free and on-/off-trajectory strategies, and suggest directions for integrating feedforward open-loop control with feedback mechanisms.
Abstract
Reinforcement learning has traditionally focused on learning state-dependent policies to solve optimal control problems in a closed-loop fashion. In this work, we introduce the paradigm of open-loop reinforcement learning where a fixed action sequence is learned instead. We present three new algorithms: one robust model-based method and two sample-efficient model-free methods. Rather than basing our algorithms on Bellman's equation from dynamic programming, our work builds on Pontryagin's principle from the theory of open-loop optimal control. We provide convergence guarantees and evaluate all methods empirically on a pendulum swing-up task, as well as on two high-dimensional MuJoCo tasks, significantly outperforming existing baselines.
