Convergent NMPC-based Reinforcement Learning Using Deep Expected Sarsa and Nonlinear Temporal Difference Learning
Amine Salaje, Thomas Chevet, Nicolas Langlois
TL;DR
The paper tackles tuning nonlinear model predictive control (NMPC) parameters with reinforcement learning (RL) to improve constrained, nonlinear control. It proposes two methods: (i) NMPC-based off-policy regularized deep ES (RDES), which uses an NMPC for the current action-value and a neural network augmented with the current parameter vector $\\boldsymbol{\\theta}$ to approximate the subsequent value, reducing online computation; and (ii) NMPC-based gradient ES (GES), which applies gradient temporal difference (GTD) updates with MSPBE guarantees to ensure convergence despite nonlinear NMPC approximations. Empirical results on a diff-drive robot show that RDES achieves faster, stable learning and matches the performance of GES, while GES provides formal convergence guarantees for the nonlinear function-approximation setting. Overall, the work demonstrates that integrating NMPC with GTD-based RL yields stable, efficient tuning of constrained nonlinear controllers with practical implications for real-time autonomous systems.
Abstract
In this paper, we present a learning-based nonlinear model predictive controller (NMPC) using an original reinforcement learning (RL) method to learn the optimal weights of the NMPC scheme, for which two methods are proposed. Firstly, the controller is used as the current action-value function of a deep Expected Sarsa where the subsequent action-value function, usually obtained with a secondary NMPC, is approximated with a neural network (NN). With respect to existing methods, we add to the NN's input the current value of the NMPC's learned parameters so that the network is able to approximate the action-value function and stabilize the learning performance. Additionally, with the use of the NN, the real-time computational burden is approximately halved without affecting the closed-loop performance. Secondly, we combine gradient temporal difference methods with a parametrized NMPC as a function approximator of the Expected Sarsa RL method to overcome the potential parameters' divergence and instability issues when nonlinearities are present in the function approximation. The simulation results show that the proposed approach converges to a locally optimal solution without instability problems.
