Table of Contents
Fetching ...

Hierarchical Reinforcement Learning with Low-Level MPC for Multi-Agent Control

Max Studt, Georg Schildbach

TL;DR

This work tackles safe, coordinated multi-agent control under constraints by integrating high-level reinforcement learning with a low-level model predictive control layer. The authors introduce ROI-guided hierarchical RL where each agent's policy selects a discrete target region and outputs a continuous point within that ROI, which the decentralized MPC then tracks under dynamics and safety constraints, formalized as $\mathcal{T}=\{\tau\in\mathbb{R}^d:\|\tau-p_{c,\mathrm{target}}\|\le r_{\mathrm{ROI}}\}$ and a finite-horizon objective $\min_{\{u_{i,k},\xi_{i,k}^{\mathrm{sep}},\xi_{i,k,w}^{\mathrm{obs}}\}} \sum_{k=0}^{N-1}\bigl[(p_{i,k}-\tau_t^i)^\top Q (p_{i,k}-\tau_t^i) + u_{i,k}^\top R u_{i,k}\bigr]+\text{slack penalties}$. Applied to a predator–prey MARL benchmark, the approach outperforms end-to-end and shielding-based baselines in reward, safety, and consistency, and demonstrates improved sample efficiency, faster convergence, and robustness to ROI variations. The results highlight the value of combining structured learning with certified low-level control for safe, generalizable multi-agent systems.

Abstract

Achieving safe and coordinated behavior in dynamic, constraint-rich environments remains a major challenge for learning-based control. Pure end-to-end learning often suffers from poor sample efficiency and limited reliability, while model-based methods depend on predefined references and struggle to generalize. We propose a hierarchical framework that combines tactical decision-making via reinforcement learning (RL) with low-level execution through Model Predictive Control (MPC). For the case of multi-agent systems this means that high-level policies select abstract targets from structured regions of interest (ROIs), while MPC ensures dynamically feasible and safe motion. Tested on a predator-prey benchmark, our approach outperforms end-to-end and shielding-based RL baselines in terms of reward, safety, and consistency, underscoring the benefits of combining structured learning with model-based control.

Hierarchical Reinforcement Learning with Low-Level MPC for Multi-Agent Control

TL;DR

This work tackles safe, coordinated multi-agent control under constraints by integrating high-level reinforcement learning with a low-level model predictive control layer. The authors introduce ROI-guided hierarchical RL where each agent's policy selects a discrete target region and outputs a continuous point within that ROI, which the decentralized MPC then tracks under dynamics and safety constraints, formalized as and a finite-horizon objective . Applied to a predator–prey MARL benchmark, the approach outperforms end-to-end and shielding-based baselines in reward, safety, and consistency, and demonstrates improved sample efficiency, faster convergence, and robustness to ROI variations. The results highlight the value of combining structured learning with certified low-level control for safe, generalizable multi-agent systems.

Abstract

Achieving safe and coordinated behavior in dynamic, constraint-rich environments remains a major challenge for learning-based control. Pure end-to-end learning often suffers from poor sample efficiency and limited reliability, while model-based methods depend on predefined references and struggle to generalize. We propose a hierarchical framework that combines tactical decision-making via reinforcement learning (RL) with low-level execution through Model Predictive Control (MPC). For the case of multi-agent systems this means that high-level policies select abstract targets from structured regions of interest (ROIs), while MPC ensures dynamically feasible and safe motion. Tested on a predator-prey benchmark, our approach outperforms end-to-end and shielding-based RL baselines in terms of reward, safety, and consistency, underscoring the benefits of combining structured learning with model-based control.

Paper Structure

This paper contains 10 sections, 13 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of our hierarchical decision-making and control framework. The high-level MARL policy selects discrete targets from a structured ROI, informed by task-specific prior knowledge. A low-level MPC controller then tracks the selected target under dynamics and safety constraints.
  • Figure 2: Overview of the RL–MPC hierarchical architecture. The RL policy outputs reference positions for the MPC to track. The framework also permits learning other MPC elements (e.g., cost parameters, constraint margins, or models), enabling adaptive control across multiple layers.
  • Figure 4: Overview of our hierarchical CTDE architecture. During training (red), a centralized critic computes an advantage estimate $\hat{A}_t^i$ for each agent $i$ based on the global state and the agents’ actions. This estimate is used to update the decentralized actor policy. During execution (blue), each actor receives only its local observation $o_t^i$ and produces tactic parameters, which are passed to a low-level MPC controller that generates continuous control actions $u_{i,0}$. The environment returns observations and reward signals based on all agents’ actions and states.
  • Figure 5: Training reward curves for the three MARL schemes, ROI-guided learning, End-to-End approach, and Shielding MPC, across the three evaluation layouts: (a) Layout 1: no obstacles, (b) Layout 2: obstacles only, and (c) Layout 3: with obstacles and episode termination upon any predator-predator or predator-obstacle collision. Each subplot shows the evolution of the episode reward over 20 million training steps.