Table of Contents
Fetching ...

Actor-Critic Model Predictive Control: Differentiable Optimization meets Reinforcement Learning for Agile Flight

Angel Romero, Elie Aljalbout, Yunlong Song, Davide Scaramuzza

TL;DR

AC-MPC integrates a differentiable MPC as the actor within an actor-critic RL framework to combine online replanning with long-horizon learning for agile quadrotor flight. A neural cost map learns quadratic MPC costs, enabling the short-horizon MPC to be guided by task-relevant objectives while the critic handles long-term value estimation. The paper introduces Model-Predictive Value Expansion to reuse MPC predictions for critic training, and demonstrates superior robustness, sample efficiency, and sim-to-real transfer, achieving superhuman performance on drone racing tasks. Analyses reveal a strong link between the critic's Hessian and the MPC cost terms, offering mechanistic insight into the RL-MPC interplay. While effective, the approach operates within differentiable MPC constraints (input-only constraints) and highlights the need for scalable, differentiable solvers for broader applications.

Abstract

A key open challenge in agile quadrotor flight is how to combine the flexibility and task-level generality of model-free reinforcement learning (RL) with the structure and online replanning capabilities of model predictive control (MPC), aiming to leverage their complementary strengths in dynamic and uncertain environments. This paper provides an answer by introducing a new framework called Actor-Critic Model Predictive Control. The key idea is to embed a differentiable MPC within an actor-critic RL framework. This integration allows for short-term predictive optimization of control actions through MPC, while leveraging RL for end-to-end learning and exploration over longer horizons. Through various ablation studies, conducted in the context of agile quadrotor racing, we expose the benefits of the proposed approach: it achieves better out-of-distribution behaviour, better robustness to changes in the quadrotor's dynamics and improved sample efficiency. Additionally, we conduct an empirical analysis using a quadrotor platform that reveals a relationship between the critic's learned value function and the cost function of the differentiable MPC, providing a deeper understanding of the interplay between the critic's value and the MPC cost functions. Finally, we validate our method in a drone racing task on different tracks, in both simulation and the real world. Our method achieves the same superhuman performance as state-of-the-art model-free RL, showcasing speeds of up to 21 m/s. We show that the proposed architecture can achieve real-time control performance, learn complex behaviors via trial and error, and retain the predictive properties of the MPC to better handle out-of-distribution behavior.

Actor-Critic Model Predictive Control: Differentiable Optimization meets Reinforcement Learning for Agile Flight

TL;DR

AC-MPC integrates a differentiable MPC as the actor within an actor-critic RL framework to combine online replanning with long-horizon learning for agile quadrotor flight. A neural cost map learns quadratic MPC costs, enabling the short-horizon MPC to be guided by task-relevant objectives while the critic handles long-term value estimation. The paper introduces Model-Predictive Value Expansion to reuse MPC predictions for critic training, and demonstrates superior robustness, sample efficiency, and sim-to-real transfer, achieving superhuman performance on drone racing tasks. Analyses reveal a strong link between the critic's Hessian and the MPC cost terms, offering mechanistic insight into the RL-MPC interplay. While effective, the approach operates within differentiable MPC constraints (input-only constraints) and highlights the need for scalable, differentiable solvers for broader applications.

Abstract

A key open challenge in agile quadrotor flight is how to combine the flexibility and task-level generality of model-free reinforcement learning (RL) with the structure and online replanning capabilities of model predictive control (MPC), aiming to leverage their complementary strengths in dynamic and uncertain environments. This paper provides an answer by introducing a new framework called Actor-Critic Model Predictive Control. The key idea is to embed a differentiable MPC within an actor-critic RL framework. This integration allows for short-term predictive optimization of control actions through MPC, while leveraging RL for end-to-end learning and exploration over longer horizons. Through various ablation studies, conducted in the context of agile quadrotor racing, we expose the benefits of the proposed approach: it achieves better out-of-distribution behaviour, better robustness to changes in the quadrotor's dynamics and improved sample efficiency. Additionally, we conduct an empirical analysis using a quadrotor platform that reveals a relationship between the critic's learned value function and the cost function of the differentiable MPC, providing a deeper understanding of the interplay between the critic's value and the MPC cost functions. Finally, we validate our method in a drone racing task on different tracks, in both simulation and the real world. Our method achieves the same superhuman performance as state-of-the-art model-free RL, showcasing speeds of up to 21 m/s. We show that the proposed architecture can achieve real-time control performance, learn complex behaviors via trial and error, and retain the predictive properties of the MPC to better handle out-of-distribution behavior.
Paper Structure (27 sections, 17 equations, 13 figures, 5 tables, 2 algorithms)

This paper contains 27 sections, 17 equations, 13 figures, 5 tables, 2 algorithms.

Figures (13)

  • Figure 1: Top: A block diagram of an actor-critic reinforcement learning architecture with a Multilayer Perceptron (MLP) Bottom: A block diagram of the proposed approach. We combine the strength of actor-critic RL and the robustness of MPC by placing a differentiable MPC as the last module of the actor policy. At deployment time, the commands for the environment are drawn from solving an MPC, which leverages the system's dynamics and finds the optimal solution given the current state. We show that the proposed approach achieves better out-of-distribution behavior and better robustness to changes in the dynamics. We also show that the predictions of the differentiable MPC can be used to improve the learning of the value function and the sample efficiency.
  • Figure 2: Actor-Critic Model Predictive Control (AC-MPC) applied to agile quadrotor flight: velocity profiles and corresponding value function plots. The left side illustrates horizontal flight, while the right side shows vertical flight. In the value function plots, areas with high values (depicted in yellow) indicate regions with the highest expected returns. The MPC predictions are shown as black Xs.
  • Figure 3: Evolution of the collective thrust command over time for both AC-MLP and AC-MPC during trials on the SplitS track. The figure demonstrates AC-MPC's ability to effectively utilize control input saturation due to its access to the system dynamics and control constraints. In contrast, AC-MLP exhibits less consistent saturation behavior.
  • Figure 4: Reward evolution for agile quadrotor flight in the Split-S track. Three different cost map representations are shown: Diagonal, Cholesky and Full Matrix.
  • Figure 5: Reward evolution for drone racing in three different tracks, for AC-MLP (for different N values of the horizon, $N = 2$ and $N = 5$) and AC-MPC. The values have been obtained after optimizing the initial exploration standard deviation for both approaches independently, and then selected the best result from each. For Vertical and SplitS tracks, one can observe how the AC-MPC approach is able to leverage its prior knowledge of the system and showcase improved learning. For the Horizontal track, because of its simplicity, the leverage of prior knowledge is diluted.
  • ...and 8 more figures