Table of Contents
Fetching ...

Critic-Free Deep Reinforcement Learning for Maritime Coverage Path Planning on Irregular Hexagonal Grids

Carlos S. Sepúlveda, Gonzalo A. Ruz

Abstract

Maritime surveillance missions, such as search and rescue and environmental monitoring, rely on the efficient allocation of sensing assets over vast and geometrically complex areas. Traditional Coverage Path Planning (CPP) approaches depend on decomposition techniques that struggle with irregular coastlines, islands, and exclusion zones, or require computationally expensive re-planning for every instance. We propose a Deep Reinforcement Learning (DRL) framework to solve CPP on hexagonal grid representations of irregular maritime areas. Unlike conventional methods, we formulate the problem as a neural combinatorial optimization task where a Transformer-based pointer policy autoregressively constructs coverage tours. To overcome the instability of value estimation in long-horizon routing problems, we implement a critic-free Group-Relative Policy Optimization (GRPO) scheme. This method estimates advantages through within-instance comparisons of sampled trajectories rather than relying on a value function. Experiments on 1,000 unseen synthetic maritime environments demonstrate that a trained policy achieves a 99.0% Hamiltonian success rate, more than double the best heuristic (46.0%), while producing paths 7% shorter and with 24% fewer heading changes than the closest baseline. All three inference modes (greedy, stochastic sampling, and sampling with 2-opt refinement) operate under 50~ms per instance on a laptop GPU, confirming feasibility for real-time on-board deployment.

Critic-Free Deep Reinforcement Learning for Maritime Coverage Path Planning on Irregular Hexagonal Grids

Abstract

Maritime surveillance missions, such as search and rescue and environmental monitoring, rely on the efficient allocation of sensing assets over vast and geometrically complex areas. Traditional Coverage Path Planning (CPP) approaches depend on decomposition techniques that struggle with irregular coastlines, islands, and exclusion zones, or require computationally expensive re-planning for every instance. We propose a Deep Reinforcement Learning (DRL) framework to solve CPP on hexagonal grid representations of irregular maritime areas. Unlike conventional methods, we formulate the problem as a neural combinatorial optimization task where a Transformer-based pointer policy autoregressively constructs coverage tours. To overcome the instability of value estimation in long-horizon routing problems, we implement a critic-free Group-Relative Policy Optimization (GRPO) scheme. This method estimates advantages through within-instance comparisons of sampled trajectories rather than relying on a value function. Experiments on 1,000 unseen synthetic maritime environments demonstrate that a trained policy achieves a 99.0% Hamiltonian success rate, more than double the best heuristic (46.0%), while producing paths 7% shorter and with 24% fewer heading changes than the closest baseline. All three inference modes (greedy, stochastic sampling, and sampling with 2-opt refinement) operate under 50~ms per instance on a laptop GPU, confirming feasibility for real-time on-board deployment.

Paper Structure

This paper contains 51 sections, 12 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Taxonomy of surveillance-related coverage, patrolling, and routing problems underpinning our CPP formulation on hexagonal grids. The three columns mirror the structure of the Related work section, while Table \ref{['tab:taxonomy']} provides a concise summary of the corresponding problem families.
  • Figure 2: Hexagonal tessellation-to-graph pipeline for an irregular AOI with an internal obstacle. Hexagon size is set by the sensor footprint. (a) AOI, obstacle, and OBB. (b) OBB-aligned hex grid. (c) Visitable cells selected by intersection tests. (d) Graph edges defined by hex-neighborhood adjacency using cell centers (including the base node). (e) Final graph used for coverage planning. (The pipeline is illustrated with an explicit polygonal obstacle for visual clarity; in the experimental dataset, obstacles are implemented as individual cell removals---see Section \ref{['subsec:dataset']}.)
  • Figure 3: Proposed Transformer-based pointer policy for coverage path planning. The Graph Encoder pre-computes static node embeddings, while the Decoder dynamically aggregates the agent's spatial context and environmental signals to generate a query. The Pointer Network computes attention scores over valid nodes, strictly constrained by a feasibility mask to guarantee valid routing.
  • Figure 4: Training and validation success rate (greedy decoding) across epochs. The dashed line marks the selected checkpoint (epoch 30, val. SR = 95.5%). The shaded region indicates overfitting, where validation performance declines while training performance remains stable.
  • Figure 5: Failure mode analysis of the proposed RL-BoK16+2opt policy. In the rare instances (approx. 1.0%) where the stochastic sampling fails to secure a strict single-visit Hamiltonian path, the failures consistently correspond to severe geometric constraints. As illustrated, these include (a) dead-ends in narrow 1D corridors, (b) topological bisections that isolate clusters of unvisited nodes, and (c) self-occlusion when wrapping around obstacles. In these configurations, the agent commits to a sub-optimal branch that leaves remaining nodes unreachable without revisiting cells, thereby triggering the early dead-end penalty. These edge cases highlight the fundamental fragility of strictly Hamiltonian routing on highly irregular grids and strongly motivate future operational extensions that incorporate a bounded node-revisitation budget.
  • ...and 1 more figures