Model Predictive Control and Reinforcement Learning: A Unified Framework Based on Dynamic Programming

Dimitri P. Bertsekas

Model Predictive Control and Reinforcement Learning: A Unified Framework Based on Dynamic Programming

Dimitri P. Bertsekas

TL;DR

A new conceptual framework that connects approximate Dynamic Programming, Model Predictive Control, and Reinforcement Learning (RL) is described, which provides a vehicle for bridging the cultural gap between RL and MPC, and sheds new light on some fundamental issues in MPC.

Abstract

In this paper we describe a new conceptual framework that connects approximate Dynamic Programming (DP), Model Predictive Control (MPC), and Reinforcement Learning (RL). This framework centers around two algorithms, which are designed largely independently of each other and operate in synergy through the powerful mechanism of Newton's method. We call them the off-line training and the on-line play algorithms. The names are borrowed from some of the major successes of RL involving games; primary examples are the recent (2017) AlphaZero program (which plays chess, [SHS17], [SSS17]), and the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon, [Tes94], [Tes95], [TeG96]). In these game contexts, the off-line training algorithm is the method used to teach the program how to evaluate positions and to generate good moves at any given position, while the on-line play algorithm is the method used to play in real time against human or computer opponents. Significantly, the synergy between off-line training and on-line play also underlies MPC (as well as other major classes of sequential decision problems), and indeed the MPC design architecture is very similar to the one of AlphaZero and TD-Gammon. This conceptual insight provides a vehicle for bridging the cultural gap between RL and MPC, and sheds new light on some fundamental issues in MPC. These include the enhancement of stability properties through rollout, the treatment of uncertainty through the use of certainty equivalence, the resilience of MPC in adaptive control settings that involve changing system parameters, and the insights provided by the superlinear performance bounds implied by Newton's method.

Model Predictive Control and Reinforcement Learning: A Unified Framework Based on Dynamic Programming

TL;DR

Abstract

Paper Structure (26 sections, 51 equations, 19 figures)

This paper contains 26 sections, 51 equations, 19 figures.

Introduction
An MPC Problem Formulation
Approximation in Value Space - MPC and RL
Rollout with a Stable Policy
Off-Line Training and On-line Play
AlphaZero and TD-Gammon
An Overview of our Framework
Off-Line Training and On-Line Play Synergy Through Newton's Method
The Riccati Equation
Iterative Solution by Value and Policy Iteration
Visualizing Approximation in Value Space
Region of Stability of Approximation in Value Space
Rollout and Policy Iteration
Truncated Rollout
Double Rollout
...and 11 more sections

Figures (19)

Figure 1: Illustration of approximation in value space with one-step lookahead.
Figure 2: Illustration of approximation in value space with $\ell$-step lookahead. The $\ell$-step minimization at $x_k$ yields a sequence $\tilde{u}_k,\tilde{u}_{k+1},\ldots,\tilde{u}_{k+\ell-1}$. The control $\tilde{u}_k$ is applied at $x_k$, and defines the $\ell$-step lookahead policy $\tilde{\mu}$ via $\tilde{\mu}(x_k)=\tilde{u}_k$. The controls $\tilde{u}_{k+1},\ldots,\tilde{u}_{k+\ell-1}$ are discarded. This is similar to mainstream MPC schemes.
Figure 3: Illustration of the architecture of AlphaZero chess. It uses a very long lookahead minimization involving moves and countermoves of the two players followed by a terminal position evaluator, which is designed through extensive off-line training using a deep neural network. There are many implementation details that we will not discuss here; for example the lookahead is selective, because some lookahead paths are pruned, by using a form of Monte Carlo tree search. Also a primitive form of rollout is used at the end of the lookahead minimization to resolve dynamic terminal positions. Note that the off-line-trained neural network of AlphaZero produces both a position evaluator and a playing policy. However, the neural network-trained policy is not used directly for on-line play.
Figure 4: Illustration of the architecture of TD-Gammon with truncated rollout [TeG96]. It uses a relatively short lookahead minimization followed by rollout and terminal position evaluation (i.e., game simulation with the one-step lookahead player; the game is truncated after a number of moves, with a position evaluation at the end). Note that backgammon involves stochastic uncertainty, and its state is the pair of current board position and dice roll.
Figure 5: Graphical solution of the Riccati equation. The optimal cost function is $J^*(x)=K^*x^2$. The scalar $K^*$ solves the fixed point equation $K=F(K)$. It can be found graphically as the positive value of $K$ that corresponds to the point where the graphs of the functions $K$ and $F(K)$ meet. A similar interpretation can be given for the solution of the general Bellman equation, which however cannot be visually depicted for problems involving more than one or two states; see the books [Ber20], [Ber22a], and [Ber22b].
...and 14 more figures

Model Predictive Control and Reinforcement Learning: A Unified Framework Based on Dynamic Programming

TL;DR

Abstract

Model Predictive Control and Reinforcement Learning: A Unified Framework Based on Dynamic Programming

Authors

TL;DR

Abstract

Table of Contents

Figures (19)