Table of Contents
Fetching ...

Hierarchical Policy-Gradient Reinforcement Learning for Multi-Agent Shepherding Control of Non-Cohesive Targets

Stefano Covone, Italo Napolitano, Francesco De Lellis, Mario di Bernardo

TL;DR

This work tackles shepherding non-cohesive targets with multiple decentralized herders by introducing a hierarchical policy-gradient framework based on PPO and MAPPO. It learns both driving and target-selection policies in a fully model-free setting with continuous actions, training the driving component in a single-herder/single-target scenario and the target-selection component in multi-agent contexts. The approach demonstrates improved settling times and path efficiency over a model-based baseline, scales to larger target sets using topological sensing, and remains robust under parameter variations. The results have practical implications for real-world multi-robot shepherding and indirect-control problems, with future work targeting truly large-scale systems, heterogeneous agents, and physical-robot validation.

Abstract

We propose a decentralized reinforcement learning solution for multi-agent shepherding of non-cohesive targets using policy-gradient methods. Our architecture integrates target-selection with target-driving through Proximal Policy Optimization, overcoming discrete-action constraints of previous Deep Q-Network approaches and enabling smoother agent trajectories. This model-free framework effectively solves the shepherding problem without prior dynamics knowledge. Experiments demonstrate our method's effectiveness and scalability with increased target numbers and limited sensing capabilities.

Hierarchical Policy-Gradient Reinforcement Learning for Multi-Agent Shepherding Control of Non-Cohesive Targets

TL;DR

This work tackles shepherding non-cohesive targets with multiple decentralized herders by introducing a hierarchical policy-gradient framework based on PPO and MAPPO. It learns both driving and target-selection policies in a fully model-free setting with continuous actions, training the driving component in a single-herder/single-target scenario and the target-selection component in multi-agent contexts. The approach demonstrates improved settling times and path efficiency over a model-based baseline, scales to larger target sets using topological sensing, and remains robust under parameter variations. The results have practical implications for real-world multi-robot shepherding and indirect-control problems, with future work targeting truly large-scale systems, heterogeneous agents, and physical-robot validation.

Abstract

We propose a decentralized reinforcement learning solution for multi-agent shepherding of non-cohesive targets using policy-gradient methods. Our architecture integrates target-selection with target-driving through Proximal Policy Optimization, overcoming discrete-action constraints of previous Deep Q-Network approaches and enabling smoother agent trajectories. This model-free framework effectively solves the shepherding problem without prior dynamics knowledge. Experiments demonstrate our method's effectiveness and scalability with increased target numbers and limited sensing capabilities.

Paper Structure

This paper contains 11 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Two-layer feedback control scheme: each herder $\mathbf{H}_{i,j}$ detects the other agents' positions and selects the target $\mathbf{T}_{i,j}^*$ to control via the target-selection policy, which is then driven according to the driving policy, that outputs the velocity $\mathbf{u}$ of the corresponding herder.
  • Figure 2: Learning curves during training: (a)Driving policy cumulative reward, smoothed via moving average of 200 samples; (b)Target-selection policy cumulative reward, smoothed via moving average of 2000 samples. For both policies only the first half of the training is shown, to highlight the learning phase.
  • Figure 3: Example of the learned driving policy in a single herder, single target setting: the herder (blue diamond) approaches the target (magenta circle), drives it to the goal region (green circle) and contains it. Big markers show initial positions, small ones show final positions.
  • Figure 4: Validation example of the fully learning-based solution in the $N=2, M=5$ scenario: the radii of the herders (blue lines) and targets (magenta lines) are shown, compared to the goal region radius $\rho_\mathrm{G} = 5$ (green dashed line). The herders successfully steer and contain the targets in the goal region.
  • Figure 5: Validation results for the learning-based strategies (blue) over 1000 episodes with seeded initial conditions, showing average settling time ${n^\star}$ in steps and average path length $d$ in meters. Compared against the heuristic approach (orange) from lamaShepherdingControlHerdability2024 for both the (a)$N=1,\ M=1$ and (b)$N=2,\ M=5$ configurations. A robustness analysis is also presented by varying the targets’ model parameters by 30% around their nominal values, for both (c)$N=1,\ M=1$ and (d)$N=2,\ M=5$ settings. Box plots are shown for each metric. Mann-Whitney U test was performed on each metric pair yielding $p$-values always smaller than $0.001$.
  • ...and 1 more figures