Table of Contents
Fetching ...

A view on learning robust goal-conditioned value functions: Interplay between RL and MPC

Nathan P. Lawrence, Philip D. Loewen, Michael G. Forbes, R. Bhushan Gopaluni, Ali Mesbah

TL;DR

The paper proposes a value-function centric framework that unifies reinforcement learning and model predictive control for decision-making under uncertainty. It treats RL as learning a global value function offline, and MPC as online construction of a local value function for constraint-safe planning, linking the two through a robust, goal-conditioned, scenario-based approach. Key contributions include a tutorial-style synthesis of RL and MPC, a robust offline training scheme using scenario trees, and a hybrid RL+MPC architecture where a learned terminal value guides short-horizon MPC with safety guarantees. The approach is validated through classical and robust control benchmarks, illustrating that the combined method leverages RLs long-horizon guidance with MPCs safety and constraint satisfaction, offering scalable, robust performance for complex control tasks.

Abstract

Reinforcement learning (RL) and model predictive control (MPC) offer a wealth of distinct approaches for automatic decision-making under uncertainty. Given the impact both fields have had independently across numerous domains, there is growing interest in combining the general-purpose learning capability of RL with the safety and robustness features of MPC. To this end, this paper presents a tutorial-style treatment of RL and MPC, treating them as alternative approaches to solving Markov decision processes. In our formulation, RL aims to learn a global value function through offline exploration in an uncertain environment, whereas MPC constructs a local value function through online optimization. This local-global perspective suggests new ways to design policies that combine robustness and goal-conditioned learning. Robustness is incorporated into the RL and MPC pipelines through a scenario-based approach. Goal-conditioned learning aims to alleviate the burden of engineering a reward function for RL. Combining the two leads to a single policy that unites a robust, high-level RL terminal value function with short-term, scenario-based MPC planning for reliable constraint satisfaction. This approach leverages the benefits of both RL and MPC, the effectiveness of which is demonstrated on classical control benchmarks.

A view on learning robust goal-conditioned value functions: Interplay between RL and MPC

TL;DR

The paper proposes a value-function centric framework that unifies reinforcement learning and model predictive control for decision-making under uncertainty. It treats RL as learning a global value function offline, and MPC as online construction of a local value function for constraint-safe planning, linking the two through a robust, goal-conditioned, scenario-based approach. Key contributions include a tutorial-style synthesis of RL and MPC, a robust offline training scheme using scenario trees, and a hybrid RL+MPC architecture where a learned terminal value guides short-horizon MPC with safety guarantees. The approach is validated through classical and robust control benchmarks, illustrating that the combined method leverages RLs long-horizon guidance with MPCs safety and constraint satisfaction, offering scalable, robust performance for complex control tasks.

Abstract

Reinforcement learning (RL) and model predictive control (MPC) offer a wealth of distinct approaches for automatic decision-making under uncertainty. Given the impact both fields have had independently across numerous domains, there is growing interest in combining the general-purpose learning capability of RL with the safety and robustness features of MPC. To this end, this paper presents a tutorial-style treatment of RL and MPC, treating them as alternative approaches to solving Markov decision processes. In our formulation, RL aims to learn a global value function through offline exploration in an uncertain environment, whereas MPC constructs a local value function through online optimization. This local-global perspective suggests new ways to design policies that combine robustness and goal-conditioned learning. Robustness is incorporated into the RL and MPC pipelines through a scenario-based approach. Goal-conditioned learning aims to alleviate the burden of engineering a reward function for RL. Combining the two leads to a single policy that unites a robust, high-level RL terminal value function with short-term, scenario-based MPC planning for reliable constraint satisfaction. This approach leverages the benefits of both RL and MPC, the effectiveness of which is demonstrated on classical control benchmarks.

Paper Structure

This paper contains 28 sections, 48 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: RL and MPC can be seen as alternative approaches to solving MDP. However, they both leverage the idea of selecting actions by maximizing a value function $Q$. The RL agent learns a global value function offline, while the MPC constructs a local value function for online control. $\widehat{r}$ represents a tractable reward for the MPC agent, possibly different from the true reward signal $r$.
  • Figure 2: An actor-critic agent interacts with a branching simulation environment offline to learn a robust global value function. The critic is used in the usual fashion to inform parameter updates, but also to construct a robust local MPC agent for online control of the "true" system.
  • Figure 3: A scenario tree branches at some state $x$, applying the same control action $u$ for each of the three cases in the uncertainty set $\{\psi^1, \psi^2, \psi^3\}$. Three successive states are computed using the model $f$, after which each scenario remains constant. However, it is possible to keep branching at each node.
  • Figure 4: (Top) The goal-conditioned agent gives the most consistent performance in terms of time spent in the upright position. (Bottom) The expert agent is very efficient at solving the swing up task, whereas the quadratic agent is the most aggressive. The goal-conditioned agent becomes much more efficient with its actions as the prediction horizon increases.
  • Figure 5: Starting from rest, the goal-conditioned MPC agent is given a sequence of three different unstable equilibria to reach. A corresponding animation can be found here: https://github.com/NPLawrence/RL-MPC. This experiment was performed with $\sigma^2 = 0.5$ and $N=35$.
  • ...and 6 more figures