Table of Contents
Fetching ...

Decentralized Shepherding of Non-Cohesive Swarms Through Cluttered Environments via Deep Reinforcement Learning

Cristiana Punzo, Italo Napolitano, Cinzia Tomaselli, Mario di Bernardo

TL;DR

This paper tackles decentralized shepherding of non-cohesive targets in cluttered environments using a two-layer hierarchical approach. The low-level driving policy is learned with Proximal Policy Optimization (PPO) in a minimal 1H-1T-1 obstacle setting and then deployed to larger multi-agent scenarios without retraining, guided by a decentralized high-level target assignment. The method yields collision-free trajectories and robust convergence to the circular goal region $\Omega_G$, outperforming a vortex-based heuristic in 1H-1T tests and scaling to 10H-100T with three obstacles. The results demonstrate a scalable, model-free framework for indirect control in complex domains, with future work on safety guarantees and perception-based sensing.

Abstract

This paper investigates decentralized shepherding in cluttered environments, where a limited number of herders must guide a larger group of non-cohesive, diffusive targets toward a goal region in the presence of static obstacles. A hierarchical control architecture is proposed, integrating a high-level target assignment rule, where each herder is paired with a selected target, with a learning-based low-level driving module that enables effective steering of the assigned target. The low-level policy is trained in a one-herder-one-target scenario with a rectangular obstacle using Proximal Policy Optimization and then directly extended to multi-agent settings with multiple obstacles without requiring retraining. Numerical simulations demonstrate smooth, collision-free trajectories and consistent convergence to the goal region, highlighting the potential of reinforcement learning for scalable, model-free shepherding in complex environments.

Decentralized Shepherding of Non-Cohesive Swarms Through Cluttered Environments via Deep Reinforcement Learning

TL;DR

This paper tackles decentralized shepherding of non-cohesive targets in cluttered environments using a two-layer hierarchical approach. The low-level driving policy is learned with Proximal Policy Optimization (PPO) in a minimal 1H-1T-1 obstacle setting and then deployed to larger multi-agent scenarios without retraining, guided by a decentralized high-level target assignment. The method yields collision-free trajectories and robust convergence to the circular goal region , outperforming a vortex-based heuristic in 1H-1T tests and scaling to 10H-100T with three obstacles. The results demonstrate a scalable, model-free framework for indirect control in complex domains, with future work on safety guarantees and perception-based sensing.

Abstract

This paper investigates decentralized shepherding in cluttered environments, where a limited number of herders must guide a larger group of non-cohesive, diffusive targets toward a goal region in the presence of static obstacles. A hierarchical control architecture is proposed, integrating a high-level target assignment rule, where each herder is paired with a selected target, with a learning-based low-level driving module that enables effective steering of the assigned target. The low-level policy is trained in a one-herder-one-target scenario with a rectangular obstacle using Proximal Policy Optimization and then directly extended to multi-agent settings with multiple obstacles without requiring retraining. Numerical simulations demonstrate smooth, collision-free trajectories and consistent convergence to the goal region, highlighting the potential of reinforcement learning for scalable, model-free shepherding in complex environments.

Paper Structure

This paper contains 17 sections, 12 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Two-layer hierarchical feedback control scheme based on reinforcement learning, adapted from napolitano2025hierarchical. Each herder $H_{i,j}$ detects the positions of other agents and determines the target $T_{i,j}^*$ via a target-selection heuristic. The corresponding motion is then governed by the driving policy, which outputs the velocity command $\mathbf{u}$ of the herder.
  • Figure 2: Cumulative reward during PPO training in the 1H--1T setup with a single rectangular obstacle. The inset illustrates the initialization region $\Omega_0^* \subset \Omega_0$ behind the obstacle with respect to the goal region, where the target is placed with probability $p_{\text{obs}} = 0.5$ to promote obstacle-aware behavior. The final cumulative reward maintains a nearly constant value with limited fluctuations, suggesting that the agent has converged to a stable policy capable of effectively completing the task in most training scenarios.
  • Figure 3: Comparison between the vortex heuristic and the PPO-based strategy. Subfigures (a)–(b) show the gathering time and path length metrics; subfigures (c)–(d) illustrate the corresponding obstacle-avoidance trajectories for the vortex and PPO strategies, respectively.
  • Figure 4: Evolution of the mean and standard deviation of target (magenta) and herder (blue) distances from the goal center during a representative 10H--100T episode with three rectangular obstacles. All target radii eventually fall below the goal threshold $\rho_\mathrm{G} = 5$ (green dashed line), confirming effective gathering. Herders subsequently enter the goal region as containment is not modeled. The inset shows the initial configuration of targets, herders, obstacles, and the goal region.