Table of Contents
Fetching ...

A Pontryagin Perspective on Reinforcement Learning

Onno Eberhard, Claire Vernade, Michael Muehlebach

TL;DR

This paper introduces open-loop reinforcement learning (OLRL), where the controller optimizes a fixed action sequence $u_{0:T-1}$ rather than a feedback policy, and provides convergence guarantees for three algorithms that operate with incomplete dynamics information.The authors derive a Pontryagin-based gradient framework, replacing Bellman-based DP, and present a model-based method with robustness guarantees, plus two model-free variants (on-trajectory and off-trajectory) that estimate the necessary Jacobians along trajectories via rollouts and online learning.Empirical results on an inverted pendulum and MuJoCo tasks demonstrate robust performance under model misspecification, high sample efficiency for the on-trajectory method, and strong convergence with the off-trajectory approach, sometimes rivaling closed-loop SAC baselines without function approximation.These findings highlight a viable alternative trajectory-optimization paradigm for RL in predictable dynamics, provide insights into the tradeoffs between model-based/model-free and on-/off-trajectory strategies, and suggest directions for integrating feedforward open-loop control with feedback mechanisms.

Abstract

Reinforcement learning has traditionally focused on learning state-dependent policies to solve optimal control problems in a closed-loop fashion. In this work, we introduce the paradigm of open-loop reinforcement learning where a fixed action sequence is learned instead. We present three new algorithms: one robust model-based method and two sample-efficient model-free methods. Rather than basing our algorithms on Bellman's equation from dynamic programming, our work builds on Pontryagin's principle from the theory of open-loop optimal control. We provide convergence guarantees and evaluate all methods empirically on a pendulum swing-up task, as well as on two high-dimensional MuJoCo tasks, significantly outperforming existing baselines.

A Pontryagin Perspective on Reinforcement Learning

TL;DR

This paper introduces open-loop reinforcement learning (OLRL), where the controller optimizes a fixed action sequence $u_{0:T-1}$ rather than a feedback policy, and provides convergence guarantees for three algorithms that operate with incomplete dynamics information.The authors derive a Pontryagin-based gradient framework, replacing Bellman-based DP, and present a model-based method with robustness guarantees, plus two model-free variants (on-trajectory and off-trajectory) that estimate the necessary Jacobians along trajectories via rollouts and online learning.Empirical results on an inverted pendulum and MuJoCo tasks demonstrate robust performance under model misspecification, high sample efficiency for the on-trajectory method, and strong convergence with the off-trajectory approach, sometimes rivaling closed-loop SAC baselines without function approximation.These findings highlight a viable alternative trajectory-optimization paradigm for RL in predictable dynamics, provide insights into the tradeoffs between model-based/model-free and on-/off-trajectory strategies, and suggest directions for integrating feedforward open-loop control with feedback mechanisms.

Abstract

Reinforcement learning has traditionally focused on learning state-dependent policies to solve optimal control problems in a closed-loop fashion. In this work, we introduce the paradigm of open-loop reinforcement learning where a fixed action sequence is learned instead. We present three new algorithms: one robust model-based method and two sample-efficient model-free methods. Rather than basing our algorithms on Bellman's equation from dynamic programming, our work builds on Pontryagin's principle from the theory of open-loop optimal control. We provide convergence guarantees and evaluate all methods empirically on a pendulum swing-up task, as well as on two high-dimensional MuJoCo tasks, significantly outperforming existing baselines.
Paper Structure (29 sections, 4 theorems, 45 equations, 15 figures, 2 tables)

This paper contains 29 sections, 4 theorems, 45 equations, 15 figures, 2 tables.

Key Result

theorem 1

Suppose ass:rewardass:errorass:smooth hold with $\gamma$, $\zeta$, and $L$. Let $\mu \doteq 1 - \gamma - \zeta - \gamma\zeta$ and $\nu \doteq 1 + \gamma + \zeta + \gamma\zeta$. If the step size $\eta$ is chosen small enough such that $\alpha \doteq \mu - \frac{1}{2}\eta L \nu^2$ is positive, then th where $J^\star \doteq \sup_{u\in \mathcal{U}^T}J(u)$ is the optimal value of the initial state.

Figures (15)

  • Figure 1: Comparison of closed-loop (left) and open-loop (right) control. In closed-loop RL, the goal is to learn a policy $\pi$. In open-loop RL, a fixed sequence of actions $u_{0:T-1}$ is learned instead, with $u_t$ independent of the states $x_{0:t}$.
  • Figure 2: Open-loop reinforcement learning
  • Figure 3: (a) The Jacobians of $f$ (slope of the green linearization) at the reference point $(\bar{x}_t, \bar{u}_t)$ can be estimated from the transitions $\{(x_t^{(i)}, u_t^{(i)}, x_{t+1}^{(i)})\}_{i=1}^M$ of $M$ perturbed rollouts. (b) The Jacobians of subsequent trajectories (indexed by $k$) remain close. To estimate the Jacobian at iteration $k$, the most recent iterate ($k - 1$) is more relevant than older iterates.
  • Figure 4: The inverted pendulum swing-up task. The goal is to control the force $F$ such that the tip of the pendulum swings up above the base. The shown solution was found by the on-trajectory method of \ref{['sec:on-policy']}.
  • Figure 5: The model-based open-loop RL algorithm can solve the pendulum problem reliably even with a considerable model error.
  • ...and 10 more figures

Theorems & Definitions (8)

  • theorem 1
  • proof
  • lemma 1
  • proof
  • lemma 2
  • proof
  • theorem 2
  • proof