Table of Contents
Fetching ...

ARBoids: Adaptive Residual Reinforcement Learning With Boids Model for Cooperative Multi-USV Target Defense

Jiyue Tao, Tongsheng Shen, Dexin Zhao, Feitian Zhang

TL;DR

ARBoids addresses the challenged target defense problem for USVs when attackers are more agile by marrying a Boids-based baseline with a learnable residual policy. The method introduces a state-dependent adapter that blends a DRL policy with Boids, trained via Soft Actor-Critic under CTDE, and uses curriculum learning to progressively confront stronger attackers. Empirical results in high-fidelity Gazebo simulations show that ARBoids outperforms pure Boids, residual, and vanilla DRL baselines, with strong robustness to attacker agility and generalization to unseen team sizes. The approach demonstrates practical benefits for cooperative usher defenses, offering improved interception success and scalable coordination with potential for real-world deployment and extension to adversarial learning settings.

Abstract

The target defense problem (TDP) for unmanned surface vehicles (USVs) concerns intercepting an adversarial USV before it breaches a designated target region, using one or more defending USVs. A particularly challenging scenario arises when the attacker exhibits superior maneuverability compared to the defenders, significantly complicating effective interception. To tackle this challenge, this letter introduces ARBoids, a novel adaptive residual reinforcement learning framework that integrates deep reinforcement learning (DRL) with the biologically inspired, force-based Boids model. Within this framework, the Boids model serves as a computationally efficient baseline policy for multi-agent coordination, while DRL learns a residual policy to adaptively refine and optimize the defenders' actions. The proposed approach is validated in a high-fidelity Gazebo simulation environment, demonstrating superior performance over traditional interception strategies, including pure force-based approaches and vanilla DRL policies. Furthermore, the learned policy exhibits strong adaptability to attackers with diverse maneuverability profiles, highlighting its robustness and generalization capability. The code of ARBoids will be released upon acceptance of this letter.

ARBoids: Adaptive Residual Reinforcement Learning With Boids Model for Cooperative Multi-USV Target Defense

TL;DR

ARBoids addresses the challenged target defense problem for USVs when attackers are more agile by marrying a Boids-based baseline with a learnable residual policy. The method introduces a state-dependent adapter that blends a DRL policy with Boids, trained via Soft Actor-Critic under CTDE, and uses curriculum learning to progressively confront stronger attackers. Empirical results in high-fidelity Gazebo simulations show that ARBoids outperforms pure Boids, residual, and vanilla DRL baselines, with strong robustness to attacker agility and generalization to unseen team sizes. The approach demonstrates practical benefits for cooperative usher defenses, offering improved interception success and scalable coordination with potential for real-world deployment and extension to adversarial learning settings.

Abstract

The target defense problem (TDP) for unmanned surface vehicles (USVs) concerns intercepting an adversarial USV before it breaches a designated target region, using one or more defending USVs. A particularly challenging scenario arises when the attacker exhibits superior maneuverability compared to the defenders, significantly complicating effective interception. To tackle this challenge, this letter introduces ARBoids, a novel adaptive residual reinforcement learning framework that integrates deep reinforcement learning (DRL) with the biologically inspired, force-based Boids model. Within this framework, the Boids model serves as a computationally efficient baseline policy for multi-agent coordination, while DRL learns a residual policy to adaptively refine and optimize the defenders' actions. The proposed approach is validated in a high-fidelity Gazebo simulation environment, demonstrating superior performance over traditional interception strategies, including pure force-based approaches and vanilla DRL policies. Furthermore, the learned policy exhibits strong adaptability to attackers with diverse maneuverability profiles, highlighting its robustness and generalization capability. The code of ARBoids will be released upon acceptance of this letter.

Paper Structure

This paper contains 25 sections, 14 equations, 7 figures.

Figures (7)

  • Figure 1: Schematic diagram illustrating the target defense problem involving multiple defenders and a more agile attacker, as addressed in this letter.
  • Figure 2: Architecture of the policy and critic networks for a single defender agent. The policy network comprises three modules: (i) State embedding, which processes the state component $s_{i,AT}$ and $s_{i,\text{Boids}}$ independently via fully-connected (Fc) layers, and encodes $s_{i,D}$ using mean observation embeddinghuttenrauch2019deep; (ii) Decision, which applies three Fc layers to the concatenated feature vector to produce the DRL action $a_\text{DRL}$; and (iii) Adapter, which integrates $a_\text{DRL}$, the Boids action $a_\text{Boids}$, and the decision-module hidden features. The adapter outputs a weighting coefficient $\theta$, linearly mapped to $[0,1]$, to adaptively weight $a_\text{DRL}$ and $a_\text{Boids}$. The critic network uses modules (i) and (ii), and includes an additional action-embedding layer to encode $a_\text{DRL}$ and $\theta$.
  • Figure 3: (a) State space components $s_{i,D}$ and $s_{i,AT}$ for each defender $i$, encompassing information about other defenders ($D$), the attacker ($A$), and the target region ($T$). All observations are expressed in local coordinates. (b) Diagram of components included in the formation reward $r_\text{form}$. Here, $\vec{d}_{iA}$ denotes the unit vector from defender $i$ to the attacker, $\vec{d}_{TA}$ is the unit vector from the target to the attacker, and $\vec{d}_{DA} = \sum_{i=1}^{n} \vec{d}_{iA}$ is the cumulative direction vector of all defenders.
  • Figure 4: Snapshots (left) and trajectories (right) from a representative Gazebo VRX experiment deploying the proposed ARBoids method. VRX is a high-fidelity marine simulator and the USV model follows the dynamics in \ref{['eq:usv-motion']}. The attacker, with an agility level of $\mathcal{L}_\text{agi}=2.25$, tries to penetrate the dock area. The timestamp $t$ is displayed in the upper-left corner of each snapshot. Red and blue arrows denote the heading directions of the attacker and defender, respectively.
  • Figure 5: Learning curves comparing our full method (ARBoids), ablations (w/o formation reward, FR; w/o curriculum learning, CL), and baselines (RP, SAC, Boids). Success rates are measured every 5,000 steps to assess learning progress. Shaded regions indicate the standard deviation across 5 independent runs. ARBoids substantially outperforms all baselines, and removing either FR or CL results in a pronounced degradation in performance.
  • ...and 2 more figures