A view on learning robust goal-conditioned value functions: Interplay between RL and MPC
Nathan P. Lawrence, Philip D. Loewen, Michael G. Forbes, R. Bhushan Gopaluni, Ali Mesbah
TL;DR
The paper proposes a value-function centric framework that unifies reinforcement learning and model predictive control for decision-making under uncertainty. It treats RL as learning a global value function offline, and MPC as online construction of a local value function for constraint-safe planning, linking the two through a robust, goal-conditioned, scenario-based approach. Key contributions include a tutorial-style synthesis of RL and MPC, a robust offline training scheme using scenario trees, and a hybrid RL+MPC architecture where a learned terminal value guides short-horizon MPC with safety guarantees. The approach is validated through classical and robust control benchmarks, illustrating that the combined method leverages RLs long-horizon guidance with MPCs safety and constraint satisfaction, offering scalable, robust performance for complex control tasks.
Abstract
Reinforcement learning (RL) and model predictive control (MPC) offer a wealth of distinct approaches for automatic decision-making under uncertainty. Given the impact both fields have had independently across numerous domains, there is growing interest in combining the general-purpose learning capability of RL with the safety and robustness features of MPC. To this end, this paper presents a tutorial-style treatment of RL and MPC, treating them as alternative approaches to solving Markov decision processes. In our formulation, RL aims to learn a global value function through offline exploration in an uncertain environment, whereas MPC constructs a local value function through online optimization. This local-global perspective suggests new ways to design policies that combine robustness and goal-conditioned learning. Robustness is incorporated into the RL and MPC pipelines through a scenario-based approach. Goal-conditioned learning aims to alleviate the burden of engineering a reward function for RL. Combining the two leads to a single policy that unites a robust, high-level RL terminal value function with short-term, scenario-based MPC planning for reliable constraint satisfaction. This approach leverages the benefits of both RL and MPC, the effectiveness of which is demonstrated on classical control benchmarks.
