Table of Contents
Fetching ...

Platform-Agnostic Reinforcement Learning Framework for Safe Exploration of Cluttered Environments with Graph Attention

Gabriele Calzolari, Vidya Sumathy, Christoforos Kanellakis, George Nikolakopoulos

TL;DR

Problem: safe, efficient exploration in cluttered environments requires guarantees against collisions. Approach: a platform-agnostic hierarchical reinforcement learning framework combines a graph neural network policy for next-waypoint selection with a safety filter, trained using Proximal Policy Optimization (PPO) and augmented by a frontier- and potential-field-inspired reward. Contributions: (1) a GNN-based exploration policy with attention, (2) a safety filter that overrides infeasible actions with the closest feasible one, (3) a frontier-guided reward design that balances exploration gains and safety, and (4) validation in both simulation and real-world lab experiments demonstrating robust performance. Significance: demonstrates practical deployment potential of learning-based exploration on robotic platforms with explicit safety guarantees in cluttered spaces.

Abstract

Autonomous exploration of obstacle-rich spaces requires strategies that ensure efficiency while guaranteeing safety against collisions with obstacles. This paper investigates a novel platform-agnostic reinforcement learning framework that integrates a graph neural network-based policy for next-waypoint selection, with a safety filter ensuring safe mobility. Specifically, the neural network is trained using reinforcement learning through the Proximal Policy Optimization (PPO) algorithm to maximize exploration efficiency while minimizing safety filter interventions. Henceforth, when the policy proposes an infeasible action, the safety filter overrides it with the closest feasible alternative, ensuring consistent system behavior. In addition, this paper introduces a reward function shaped by a potential field that accounts for both the agent's proximity to unexplored regions and the expected information gain from reaching them. The proposed framework combines the adaptability of reinforcement learning-based exploration policies with the reliability provided by explicit safety mechanisms. This feature plays a key role in enabling the deployment of learning-based policies on robotic platforms operating in real-world environments. Extensive evaluations in both simulations and experiments performed in a lab environment demonstrate that the approach achieves efficient and safe exploration in cluttered spaces.

Platform-Agnostic Reinforcement Learning Framework for Safe Exploration of Cluttered Environments with Graph Attention

TL;DR

Problem: safe, efficient exploration in cluttered environments requires guarantees against collisions. Approach: a platform-agnostic hierarchical reinforcement learning framework combines a graph neural network policy for next-waypoint selection with a safety filter, trained using Proximal Policy Optimization (PPO) and augmented by a frontier- and potential-field-inspired reward. Contributions: (1) a GNN-based exploration policy with attention, (2) a safety filter that overrides infeasible actions with the closest feasible one, (3) a frontier-guided reward design that balances exploration gains and safety, and (4) validation in both simulation and real-world lab experiments demonstrating robust performance. Significance: demonstrates practical deployment potential of learning-based exploration on robotic platforms with explicit safety guarantees in cluttered spaces.

Abstract

Autonomous exploration of obstacle-rich spaces requires strategies that ensure efficiency while guaranteeing safety against collisions with obstacles. This paper investigates a novel platform-agnostic reinforcement learning framework that integrates a graph neural network-based policy for next-waypoint selection, with a safety filter ensuring safe mobility. Specifically, the neural network is trained using reinforcement learning through the Proximal Policy Optimization (PPO) algorithm to maximize exploration efficiency while minimizing safety filter interventions. Henceforth, when the policy proposes an infeasible action, the safety filter overrides it with the closest feasible alternative, ensuring consistent system behavior. In addition, this paper introduces a reward function shaped by a potential field that accounts for both the agent's proximity to unexplored regions and the expected information gain from reaching them. The proposed framework combines the adaptability of reinforcement learning-based exploration policies with the reliability provided by explicit safety mechanisms. This feature plays a key role in enabling the deployment of learning-based policies on robotic platforms operating in real-world environments. Extensive evaluations in both simulations and experiments performed in a lab environment demonstrate that the approach achieves efficient and safe exploration in cluttered spaces.

Paper Structure

This paper contains 12 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the proposed safe reinforcement learning framework for exploration in cluttered environments. The hierarchical architecture integrates a GNN-based policy with a safety filter to generate the next-step waypoint for exploration. The resulting high-level movement can be applied either to a simulated environment in Gymnasium for policy training and ablation studies (left branch), or to the laboratory setting with the Unitree Go1 quadruped robot for physical experiments (right branch).
  • Figure 2: Illustration of the exploration graph used as the observation $o_t$ and the GNN-based policy. The image shows a section of the agent's map, where occupied, unknown, and free cells are represented in black, gray, and white, respectively. The agent's position, feasible and infeasible next-step navigation goals, and frontiers are indicated in blue, green, red, and orange, respectively. The graph structure encodes node relationships, with edge thickness proportional to the distance between connected nodes. On the right, the local neighborhood of a frontier and its extracted feature vector are highlighted, together with the internal architecture of the exploration policy, where the main layers are emphasized.
  • Figure 3: Training performance curve showing the mean total reward versus training steps, with rewards normalized to $[0,1]$ and smoothed using a 1000-step simple moving average.
  • Figure 4: Data collected from the simulation of the trained policies on 100 randomly generated Gymnasium environments. From left to right: (a) Median map coverage per time step (colored lines) with variability shown as ±1 standard deviation (shaded regions) across testing environments. (b) Distribution of agents' map coverage after 1000 steps. (c) Distribution of the proportion of safety-filter interventions over agent actions across 100 test simulations per policy.
  • Figure 5: From left to right: median map coverage trajectories across 100 randomly generated exploration environments with different environment sizes (a) and number of trees (b) for the proposed policy SGA. Distribution of the proportion of safety filter interventions over agent actions across 100 test simulations for the policy SGA in environments with different sizes (c) and number of trees (d).
  • ...and 1 more figures