Hierarchical Reinforcement Learning with Low-Level MPC for Multi-Agent Control
Max Studt, Georg Schildbach
TL;DR
This work tackles safe, coordinated multi-agent control under constraints by integrating high-level reinforcement learning with a low-level model predictive control layer. The authors introduce ROI-guided hierarchical RL where each agent's policy selects a discrete target region and outputs a continuous point within that ROI, which the decentralized MPC then tracks under dynamics and safety constraints, formalized as $\mathcal{T}=\{\tau\in\mathbb{R}^d:\|\tau-p_{c,\mathrm{target}}\|\le r_{\mathrm{ROI}}\}$ and a finite-horizon objective $\min_{\{u_{i,k},\xi_{i,k}^{\mathrm{sep}},\xi_{i,k,w}^{\mathrm{obs}}\}} \sum_{k=0}^{N-1}\bigl[(p_{i,k}-\tau_t^i)^\top Q (p_{i,k}-\tau_t^i) + u_{i,k}^\top R u_{i,k}\bigr]+\text{slack penalties}$. Applied to a predator–prey MARL benchmark, the approach outperforms end-to-end and shielding-based baselines in reward, safety, and consistency, and demonstrates improved sample efficiency, faster convergence, and robustness to ROI variations. The results highlight the value of combining structured learning with certified low-level control for safe, generalizable multi-agent systems.
Abstract
Achieving safe and coordinated behavior in dynamic, constraint-rich environments remains a major challenge for learning-based control. Pure end-to-end learning often suffers from poor sample efficiency and limited reliability, while model-based methods depend on predefined references and struggle to generalize. We propose a hierarchical framework that combines tactical decision-making via reinforcement learning (RL) with low-level execution through Model Predictive Control (MPC). For the case of multi-agent systems this means that high-level policies select abstract targets from structured regions of interest (ROIs), while MPC ensures dynamically feasible and safe motion. Tested on a predator-prey benchmark, our approach outperforms end-to-end and shielding-based RL baselines in terms of reward, safety, and consistency, underscoring the benefits of combining structured learning with model-based control.
