Table of Contents
Fetching ...

Sample-Efficient Reinforcement Learning with Temporal Logic Objectives: Leveraging the Task Specification to Guide Exploration

Yiannis Kantaros, Jun Wang

TL;DR

An accelerated RL algorithm that can learn control policies significantly faster than competitive approaches is proposed that relies on a novel task-driven exploration strategy that biases exploration toward directions that may contribute to task satisfaction.

Abstract

This paper addresses the problem of learning optimal control policies for systems with uncertain dynamics and high-level control objectives specified as Linear Temporal Logic (LTL) formulas. Uncertainty is considered in the workspace structure and the outcomes of control decisions giving rise to an unknown Markov Decision Process (MDP). Existing reinforcement learning (RL) algorithms for LTL tasks typically rely on exploring a product MDP state-space uniformly (using e.g., an $ε$-greedy policy) compromising sample-efficiency. This issue becomes more pronounced as the rewards get sparser and the MDP size or the task complexity increase. In this paper, we propose an accelerated RL algorithm that can learn control policies significantly faster than competitive approaches. Its sample-efficiency relies on a novel task-driven exploration strategy that biases exploration towards directions that may contribute to task satisfaction. We provide theoretical analysis and extensive comparative experiments demonstrating the sample-efficiency of the proposed method. The benefit of our method becomes more evident as the task complexity or the MDP size increases.

Sample-Efficient Reinforcement Learning with Temporal Logic Objectives: Leveraging the Task Specification to Guide Exploration

TL;DR

An accelerated RL algorithm that can learn control policies significantly faster than competitive approaches is proposed that relies on a novel task-driven exploration strategy that biases exploration toward directions that may contribute to task satisfaction.

Abstract

This paper addresses the problem of learning optimal control policies for systems with uncertain dynamics and high-level control objectives specified as Linear Temporal Logic (LTL) formulas. Uncertainty is considered in the workspace structure and the outcomes of control decisions giving rise to an unknown Markov Decision Process (MDP). Existing reinforcement learning (RL) algorithms for LTL tasks typically rely on exploring a product MDP state-space uniformly (using e.g., an -greedy policy) compromising sample-efficiency. This issue becomes more pronounced as the rewards get sparser and the MDP size or the task complexity increase. In this paper, we propose an accelerated RL algorithm that can learn control policies significantly faster than competitive approaches. Its sample-efficiency relies on a novel task-driven exploration strategy that biases exploration towards directions that may contribute to task satisfaction. We provide theoretical analysis and extensive comparative experiments demonstrating the sample-efficiency of the proposed method. The benefit of our method becomes more evident as the task complexity or the MDP size increases.

Paper Structure

This paper contains 31 sections, 6 theorems, 36 equations, 8 figures, 1 algorithm.

Key Result

Proposition 4.1

For any $(\epsilon,\delta)$-greedy policy $\boldsymbol{\mu}$, the updated $(\epsilon,\delta)$-greedy policy $\boldsymbol{\mu'}$ obtained after updating the state-action value function $Q^{\boldsymbol{\mu}}(s,a)$ satisfies $U^{\boldsymbol{\mu'}}(s)\geq U^{\boldsymbol{\mu}}(s)$, for all $s\in{\mathcal

Figures (8)

  • Figure 1: DRA corresponding to $\phi=\Diamond(\pi^{\text{Exit1}}\vee\pi^{\text{Exit2}})$. There is only one set of accepting pairs defined as ${\mathcal{G}}_1=\{q_D^F\}$ and ${\mathcal{B}}_1=\{q_D^0\}$. A transition is enabled if the robot generates a symbol satisfying the Boolean formula noted on top of the transitions. All transitions are feasible as per Def. \ref{['def:feas']}. The function $d_F$ in \ref{['eq:dist2G']} is defined as $d_F(q_D^0,{\mathcal{F}})=1$ and $d_F(q_D^F,{\mathcal{F}})=0$.
  • Figure 2: Graphical depiction of the sets ${\mathcal{X}}_{\text{goal}}(q_{t})$. The disks represent MDP states and the arrows between states mean that there exists at least one action such that the transition probability from one state to another one is non-zero. The length of the shortest path from $x_t$ to ${\mathcal{X}}_{\text{goal}}$ is $3$ hops, i.e., $J_{x_t,{\mathcal{X}}_{\text{goal}}}=3$; see \ref{['eq:dist2set']}. Also, the paths $p_j^t$, $j\in{\mathcal{J}}=\{1,2\}$ are highlighted with thick green lines. The numbers on top of the green edges represent $\max_{a}P(p_j^t(e),a,p_j^t(e+1))$; see \ref{['eq:optUnc']}. Observe that $p^*$ is the green path highlighted with gray color.
  • Figure 3: MDP-based representation of the interaction of a ground robot with corridor-like environment. The square cells represent MDP states, i.e., ${\mathcal{X}}=\{\text{Exit1},\text{Exit2}, \text{A}, \text{B},\text{C}, \text{D}, \text{E}\}$. An action enabling transition between adjacent cells with non-zero probability exists for all MDP states.
  • Figure 4: Decay rates of the parameters $\delta_e$, $\delta_b$, and $\epsilon$ considered in Section \ref{['sec:Sim']} for $\mathfrak{M}_1$ and $\mathfrak{M}_2$. The rate at which $1-\epsilon$ (red) increases is the same in all figures. As the number of episodes goes to infinity, $1-\epsilon$ converges to $1$ and both $\delta_b$ and $\delta_e$ converge to $0$. Notice that, in the bottom right figure, $\delta_b$ is always equal to $0$ resulting in random exploration ($\epsilon$-greedy policy).
  • Figure 5: A Simple Coverage Task (Section \ref{['sec:coverage']}): Comparison of average satisfaction probability $\bar{\mathbb{P}}$ when Algorithm \ref{['alg:RL-LTL']} is applied with the proposed $(\epsilon,\delta)$-greedy policy, $\epsilon$-greedy policy, Boltzmann policy, and UCB1 policy over the MDPs $\mathfrak{M}_1$, $\mathfrak{M}_2$, and $\mathfrak{M}_3$. $\text{Biased 1}-30$ and $\text{Biased 1}-100$ refer to the cases where the Biased 1 exploration method is applied under the constraint that the MDP transition probabilities are updated only during the first $30$ and $100$ episodes, respectively. The legend also includes the total runtime per method. The black stars on top of each reward curve denote the training episode where the corresponding policy is when the fastest policy has finished training over the total number of episodes.
  • ...and 3 more figures

Theorems & Definitions (23)

  • Definition 2.1: MDP
  • Definition 2.4: DRA baier2008principles
  • Example 2.5: DRA
  • Definition 2.6: PMDP
  • Definition 3.1: Feasible symbols $\sigma\in\Sigma$
  • Definition 3.2: Feasibility of DRA transitions
  • Remark 3.3: Initialization
  • Remark 3.4: Computing Shortest Path
  • Remark 3.5: Weights & Shortest Paths
  • Example 3.6: Biased Exploration
  • ...and 13 more