Table of Contents
Fetching ...

Enhance Exploration in Safe Reinforcement Learning with Contrastive Representation Learning

Duc Kien Doan, Bang Giang Le, Viet Cuong Ta

TL;DR

The paper addresses safe reinforcement learning under sparse rewards by decoupling state and action safety and learning a safety-aware latent representation. It introduces a two-phase framework that combines transferable domain priors for action safety with a contrastive auto-encoder to map observations into a latent space where safe and unsafe states are separable; a latent-distance state safety check biases exploration away from unsafe regions while preserving exploration in safe areas. Key contributions include a formal SAFETY-aware latent representation via contrastive learning, an online unsafe-embedding buffer with a distance-based safety criterion, and demonstrated improvements in exploration efficiency and safety balance on three LavaCrossing MiniGrid tasks. The approach has practical significance for enabling more efficient learning in safety-critical, high-dimensional observation spaces with sparse rewards, and offers avenues for adaptive latent-distance mechanisms and alternative unsupervised losses.

Abstract

In safe reinforcement learning, agent needs to balance between exploration actions and safety constraints. Following this paradigm, domain transfer approaches learn a prior Q-function from the related environments to prevent unsafe actions. However, because of the large number of false positives, some safe actions are never executed, leading to inadequate exploration in sparse-reward environments. In this work, we aim to learn an efficient state representation to balance the exploration and safety-prefer action in a sparse-reward environment. Firstly, the image input is mapped to latent representation by an auto-encoder. A further contrastive learning objective is employed to distinguish safe and unsafe states. In the learning phase, the latent distance is used to construct an additional safety check, which allows the agent to bias the exploration if it visits an unsafe state. To verify the effectiveness of our method, the experiment is carried out in three navigation-based MiniGrid environments. The result highlights that our method can explore the environment better while maintaining a good balance between safety and efficiency.

Enhance Exploration in Safe Reinforcement Learning with Contrastive Representation Learning

TL;DR

The paper addresses safe reinforcement learning under sparse rewards by decoupling state and action safety and learning a safety-aware latent representation. It introduces a two-phase framework that combines transferable domain priors for action safety with a contrastive auto-encoder to map observations into a latent space where safe and unsafe states are separable; a latent-distance state safety check biases exploration away from unsafe regions while preserving exploration in safe areas. Key contributions include a formal SAFETY-aware latent representation via contrastive learning, an online unsafe-embedding buffer with a distance-based safety criterion, and demonstrated improvements in exploration efficiency and safety balance on three LavaCrossing MiniGrid tasks. The approach has practical significance for enabling more efficient learning in safety-critical, high-dimensional observation spaces with sparse rewards, and offers avenues for adaptive latent-distance mechanisms and alternative unsupervised losses.

Abstract

In safe reinforcement learning, agent needs to balance between exploration actions and safety constraints. Following this paradigm, domain transfer approaches learn a prior Q-function from the related environments to prevent unsafe actions. However, because of the large number of false positives, some safe actions are never executed, leading to inadequate exploration in sparse-reward environments. In this work, we aim to learn an efficient state representation to balance the exploration and safety-prefer action in a sparse-reward environment. Firstly, the image input is mapped to latent representation by an auto-encoder. A further contrastive learning objective is employed to distinguish safe and unsafe states. In the learning phase, the latent distance is used to construct an additional safety check, which allows the agent to bias the exploration if it visits an unsafe state. To verify the effectiveness of our method, the experiment is carried out in three navigation-based MiniGrid environments. The result highlights that our method can explore the environment better while maintaining a good balance between safety and efficiency.

Paper Structure

This paper contains 12 sections, 9 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of training a prior Q-function to identify unsafe actions. Given learned Q-functions of $m$ tasks in the same domain, weights are computed and assigned to each state-action pair. These pairs are then selected and used to construct a prior Q-function $Q^*_p$.
  • Figure 2: Learning a latent representation from past observations with contrastive learning.
  • Figure 3: Illustration of our method. Given an observation $o_t$ and proposed action $a$, the agent performs a state safety check based on the average distance from the current observation embedding and embeddings in the unsafe buffer $\mathcal{B}$. If the current state is safe, the proposed action is executed. Otherwise, the agent executes a safe action computed by the action safety check module.
  • Figure 4: Minigrid Environments
  • Figure 5: The episodic returns for 5 agent in on 3 MiniGrid environments, averaged over 3 seeds.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 1