Safe Policy Exploration Improvement via Subgoals

Brian Angulo; Gregory Gorbov; Aleksandr Panov; Konstantin Yakovlev

Safe Policy Exploration Improvement via Subgoals

Brian Angulo, Gregory Gorbov, Aleksandr Panov, Konstantin Yakovlev

TL;DR

This work tackles safe, long-horizon navigation by formulating it as a Constrained Markov Decision Process and introducing SPEIS, a two-policy framework that jointly trains a Safe Policy (SAC-Lagrangian) and a Hierarchical Policy to generate intermediate subgoals. The safe policy learns to maximize reward while respecting cumulative safety constraints via an adaptive Lagrangian multiplier and dual critics for reward and safety, while the hierarchical policy predicts subgoals to guide exploration and improve long-horizon performance through KL-based regularization. Evaluations in Safety-Gym and POLAMP across multiple robot types show SPEIS achieves substantially lower collision rates (around 3% on average) and high success rates (above 80%), outperforming baselines including SAC, SAC-Lagrangian, and Safety-Layer–based methods, with faster inference than some planning-based approaches. The results demonstrate that integrating subgoal-driven exploration with safety-aware learning yields safer, more reliable navigation in challenging environments, with practical implications for real-world autonomous systems.

Abstract

Reinforcement learning is a widely used approach to autonomous navigation, showing potential in various tasks and robotic setups. Still, it often struggles to reach distant goals when safety constraints are imposed (e.g., the wheeled robot is prohibited from moving close to the obstacles). One of the main reasons for poor performance in such setups, which is common in practice, is that the need to respect the safety constraints degrades the exploration capabilities of an RL agent. To this end, we introduce a novel learnable algorithm that is based on decomposing the initial problem into smaller sub-problems via intermediate goals, on the one hand, and respects the limit of the cumulative safety constraints, on the other hand -- SPEIS(Safe Policy Exploration Improvement via Subgoals). It comprises the two coupled policies trained end-to-end: subgoal and safe. The subgoal policy is trained to generate the subgoal based on the transitions from the buffer of the safe (main) policy that helps the safe policy to reach distant goals. Simultaneously, the safe policy maximizes its rewards while attempting not to violate the limit of the cumulative safety constraints, thus providing a certain level of safety. We evaluate SPEIS in a wide range of challenging (simulated) environments that involve different types of robots in two different environments: autonomous vehicles from the POLAMP environment and car, point, doggo, and sweep from the safety-gym environment. We demonstrate that our method consistently outperforms state-of-the-art competitors and can significantly reduce the collision rate while maintaining high success rates (higher by 80% compared to the best-performing methods).

Safe Policy Exploration Improvement via Subgoals

TL;DR

Abstract

Paper Structure (29 sections, 14 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 29 sections, 14 equations, 8 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Safe Reinforcement Learning
Safety component to actor objective
Substitute action to safe action
Hierarchical Reinforcement Learning
Hierarchical and Safety Reinforcement Learning
Problem Statement
Navigation Problem
Constrained Markov Decision Process
Method
Safe Policy
SAC
SAC-Lagrangian
Hierarchical Policy
...and 14 more sections

Figures (8)

Figure 1: Safe exploration by utilizing the subgoals. Top: A safe policy often lacks sufficient exploration to effectively solve long-horizon tasks (i.e., when an agent needs to execute a long sequence of actions to reach the goal). Bottom: To mitigate this problem, we propose to generate subgoals inside the safe region to increase the exploration capabilities and, therefore, to increase the chance of successfully reaching the distant goal.
Figure 2: The transitions on safety-gym environment and policy update procedures.
Figure 3: Robots from safety-gym environment: sweeping, point, car, doggo.
Figure 4: On the left is shown the safety-gym environment, where the blue cylinders represent hazardous zones, the green cylinder -- the goal and the quadruped robot is represented by red. On the right -- POLAMP environment, where the blue rectangles represent the static obstacles, the orange arrays -- pseudo lidars, the cyan arrays -- safe zone, the green rectangle with an arrow -- the goal one, while the blue rectangle with a red arrow represents the vehicle
Figure 5: Example dataset patterns. The left figure illustrates one of patterns from the level 1 and the right figure illustrates one of patterns from the level 2, which were used for training and validation.
...and 3 more figures

Safe Policy Exploration Improvement via Subgoals

TL;DR

Abstract

Safe Policy Exploration Improvement via Subgoals

Authors

TL;DR

Abstract

Table of Contents

Figures (8)