Table of Contents
Fetching ...

Probabilistic Satisfaction of Temporal Logic Constraints in Reinforcement Learning via Adaptive Policy-Switching

Xiaoshan Lin, Sadık Bera Yüksel, Yasin Yazıcıoğlu, Derya Aksaray

TL;DR

This paper addresses a type of CRL problem where an agent aims to learn the optimal policy to maximize reward while ensuring a desired level of temporal logic constraint satisfaction throughout the learning process and proposes a novel framework that relies on switching between pure learning and constraint satisfaction.

Abstract

Constrained Reinforcement Learning (CRL) is a subset of machine learning that introduces constraints into the traditional reinforcement learning (RL) framework. Unlike conventional RL which aims solely to maximize cumulative rewards, CRL incorporates additional constraints that represent specific mission requirements or limitations that the agent must comply with during the learning process. In this paper, we address a type of CRL problem where an agent aims to learn the optimal policy to maximize reward while ensuring a desired level of temporal logic constraint satisfaction throughout the learning process. We propose a novel framework that relies on switching between pure learning (reward maximization) and constraint satisfaction. This framework estimates the probability of constraint satisfaction based on earlier trials and properly adjusts the probability of switching between learning and constraint satisfaction policies. We theoretically validate the correctness of the proposed algorithm and demonstrate its performance through comprehensive simulations.

Probabilistic Satisfaction of Temporal Logic Constraints in Reinforcement Learning via Adaptive Policy-Switching

TL;DR

This paper addresses a type of CRL problem where an agent aims to learn the optimal policy to maximize reward while ensuring a desired level of temporal logic constraint satisfaction throughout the learning process and proposes a novel framework that relies on switching between pure learning and constraint satisfaction.

Abstract

Constrained Reinforcement Learning (CRL) is a subset of machine learning that introduces constraints into the traditional reinforcement learning (RL) framework. Unlike conventional RL which aims solely to maximize cumulative rewards, CRL incorporates additional constraints that represent specific mission requirements or limitations that the agent must comply with during the learning process. In this paper, we address a type of CRL problem where an agent aims to learn the optimal policy to maximize reward while ensuring a desired level of temporal logic constraint satisfaction throughout the learning process. We propose a novel framework that relies on switching between pure learning (reward maximization) and constraint satisfaction. This framework estimates the probability of constraint satisfaction based on earlier trials and properly adjusts the probability of switching between learning and constraint satisfaction policies. We theoretically validate the correctness of the proposed algorithm and demonstrate its performance through comprehensive simulations.

Paper Structure

This paper contains 8 sections, 5 theorems, 25 equations, 6 figures, 4 tables.

Key Result

Theorem 1

Let Assumption assumption_1 hold. For any $p \in S_{P}$ of the given product MDP $\mathcal{P} = (S_{P},P_{init},A,\Delta_{P}, R_{P}, F_{P})$, let integer $k > 0$ denote the remaining time steps, $d= D^\epsilon(p)$ denote the distance-to-$F_{P}$ from $p$, and $Pr(p \xrightarrow{k} F_{P}; \pi_{GO}^{\e where

Figures (6)

  • Figure 1: An MDP where $S=\{s_0,s_1, s_2\}$, $A=\{Move,Stay\}$, $AP=\{Home,Store, Charging\,\,Station\}$, $l(s_0)=\{Store\}$, $l(s_1) =\{Home\}$, $l(s_2)=\{Charging\,\,Station\}$. Edge labels indicate the corresponding action and transition probability.
  • Figure 2: Possible transitions from state $p$ under $\pi^\epsilon_{GO}$, where $\Delta_i$ denotes the unknown transition probabilities.
  • Figure 3: Transitions (intended - blue, unintended - yellow) under each action.
  • Figure 4: An environment where yellow, green, blue, and black cells are respectively the base station, the pick-up region, the delivery regions, and the obstacles. The gray cells are reward regions (darker shades - higher reward). The arrows denote illustrative trajectories that are obtained by applying policies learned by: (a) aksaray2021probabilistically, (b) the proposed algorithm (blue: reward maximization policy $\pi$, black: $\pi_{GO}^{\epsilon}$ policy).
  • Figure 5: Reward and constraint satisfaction rate under various desired probabilities $Pr_{des}$.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Definition 1
  • Definition 2: Deterministic Policy
  • Definition 3: Product MDP
  • Definition 4: $\epsilon$-Stochastic Transitions
  • Definition 5: Distance-To-$F_{P}$
  • Definition 6: $\pi_{GO}^{\epsilon}$ Policy
  • Theorem 1
  • proof
  • Corollary 1.1
  • proof
  • ...and 7 more