Composing Reinforcement Learning Policies, with Formal Guarantees

Florent Delgrange; Guy Avni; Anna Lukina; Christian Schilling; Ann Nowé; Guillermo A. Pérez

Composing Reinforcement Learning Policies, with Formal Guarantees

Florent Delgrange, Guy Avni, Anna Lukina, Christian Schilling, Ann Nowé, Guillermo A. Pérez

TL;DR

This work addresses long-horizon control in environments structured as a map of rooms, each with unknown low-level dynamics. It couples high-level reactive synthesis on the map with low-level RL in individual rooms, using latent models learned via WAE-DQN that come with PAC guarantees on abstraction quality. The central results establish value- and planner-guarantee lifting from room-level abstractions to the two-level controller, including a memory bound of size |old V| for the planner and bounds that depend on the transition loss L_P and entrance distribution coverage. A succinct MDP representation enables efficient planning at the high level, without inflating the state space with room details. Empirically, grid-world and ViZDoom case studies demonstrate that the two-level controller achieves high success probabilities while retaining formal guarantees, even when DRL alone struggles with sparse rewards and high-dimensional observations.

Abstract

We propose a novel framework to controller design in environments with a two-level structure: a known high-level graph ("map") in which each vertex is populated by a Markov decision process, called a "room". The framework "separates concerns" by using different design techniques for low- and high-level tasks. We apply reactive synthesis for high-level tasks: given a specification as a logical formula over the high-level graph and a collection of low-level policies obtained together with "concise" latent structures, we construct a "planner" that selects which low-level policy to apply in each room. We develop a reinforcement learning procedure to train low-level policies on latent structures, which unlike previous approaches, circumvents a model distillation step. We pair the policy with probably approximately correct guarantees on its performance and on the abstraction quality, and lift these guarantees to the high-level task. These formal guarantees are the main advantage of the framework. Other advantages include scalability (rooms are large and their dynamics are unknown) and reusability of low-level policies. We demonstrate feasibility in challenging case studies where an agent navigates environments with moving obstacles and visual inputs.

Composing Reinforcement Learning Policies, with Formal Guarantees

TL;DR

Abstract

Paper Structure (55 sections, 12 theorems, 18 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 55 sections, 12 theorems, 18 equations, 13 figures, 3 tables, 1 algorithm.

Introduction
Framework (Fig. \ref{['fig:framework']})
Advantages.
Case study.
Contributions.
Preliminaries
Markov Decision Processes (MDPs).
Limiting behaviors in MDPs
Objectives and values.
Problem Formulation
Two-level model.
High-level reach and low-level reach-avoid objectives.
High-level control.
Obtaining Low-Level RL policies
Quantifying the quality of the abstraction
...and 40 more sections

Key Result

Lemma 1

Let $\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu$ be a latent policy and $\xi_{\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu}$ be the unique stationary measure of $\mathcal{M}$, then the average value difference is bounded by $L_{\mathbf{P}}$: $\mathbb{E}_{s \sim

Figures (13)

Figure 1: (a) Environment in which the agent (top-right) needs to reach the target (green, bottom-left) while avoiding moving adversaries (red). The target appears in the map as a dedicated vertex $t$. (b) The agent is trained to exit each room, in every possible direction. Training is performed in parallel simulations. An abstraction of the environment is learned via neural networks, yielding a latent model for each room. Simultaneously, a policy is learned via RL on the learned latent representation, which guarantees the agent's low-level behavior conformity through PAC bounds. More details in Sect. \ref{['sec:low_level']}. (c) Given a high-level description of the environment, a collection of latent models and policies for each room, and the specifications, synthesis outputs a high-level planner guaranteed to satisfy the specifications. The challenge resides in the way the low-level components are merged to apply synthesis while maintaining their guarantees. More details in Sect. \ref{['sec:high_level']}.
Figure 2: (a) Top: The high-level graph $\color{graphcolor}{\mathcal{G}}$ with two rooms $\color{mdpcolor}{R_0} = \ell(v_0)$ and $\color{mdpcolor}{R_1} = \ell(u)$. Middle: Part of the explicit MDP for the bottom layer; e.g., the MDP $\color{mdpcolor}{R_0}$ contains $16$ states. Traversing the edge $\langle \langle s_2,v_0 \rangle, \langle s_0, u \rangle \rangle$ corresponds to exiting $\color{mdpcolor}{R_0}$ and entering $\color{mdpcolor}{R_1}$ from direction $d_1 = \langle v_0, u \rangle$. The goal of is to reach $u'$ by exiting the room $\color{mdpcolor}{R_1}$ from direction $d_2 = \langle u, u' \rangle$ while avoiding the moving adversaries . For $i \in \left\{ 0,1 \right\}$, the entrance function $\color{mdpcolor}{\mathcal{I}_{R_i}}$ models the distribution from which the initial location of in $R_i$ is drawn. (b) A room with four policies for a planner to choose from; e.g., $\pi_\rightarrow({\cdotp} \mid s_1)$ leads to $B_{\ell(u)}$ and $\pi_\uparrow({\cdotp} \mid s_1)$ leads to $s_3$. Note that, while these are deterministic policies, in general, the policies in rooms are probabilistic.
Figure 3: Small room in a grid world.
Figure 4: To run $\mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu$ in the original environment $\mathcal{M}$, (i) map $s$ to $\phi\mathopen{}\mathclose{\left(s\right)} = \bar{s}$, (ii) draw $a \sim \mkern 1.5mu\overline{\mkern-1.5mu\pi\mkern-1.5mu}\mkern 1.5mu\mathopen{}\mathclose{\left({\cdotp} \mid \bar{s}\right)}$. $L_{\mathbf{P}}$ measures the gap (in red) between latent states produced via $\bar{s}_1 = \phi\mathopen{}\mathclose{\left(s'\right)}$ with $s' \sim \mathbf{P}\mathopen{}\mathclose{\left({\cdotp} \mid s, a\right)}$ (shortened as $\bar{s}_1 \sim \phi\mathbf{P}\mathopen{}\mathclose{\left({\cdotp} \mid s, a\right)}$) and those produced directly in the latent space: $\bar{s}_2\sim \mkern 1.5mu\overline{\mkern-1.5mu\mathbf{P}\mkern-1.5mu}\mkern 1.5mu\mathopen{}\mathclose{\left({\cdotp} \mid \bar{s}, a\right)}$.
Figure 5: Chain of reductions for synthesizing a planner $\tau$ in a two-level model $\mathcal{H}$. $\mathcal{H}$ can be formulated as an explicit MDP $\mathcal{M}$. Once the low-level policies $\mathrm{\Pi}$ are learned (Fig. \ref{['fig:intro-1']}), the synthesis problem reduces to constructing a stationary policy in an MDP plan$\mathcal{M}_{\mathrm{\Pi}}$ where $\mathrm{\Pi}$ is fixed and the state space of $\mathcal{M}_{\mathrm{\Pi}}$ encodes the directions chosen in each room. From this policy, one can derive a $\left| \mathcal{V} \right|$-memory planner $\tau$ for $\mathcal{H}$ (Thm. \ref{['thm:mdp-planner-equiv']}). Finally, finding a policy in $\mathcal{M}_{\mathrm{\Pi}}$ is equivalent to finding a policy in a succinct model$\mathcal{M}^{\mathcal{G}}_{\mathrm{\Pi}}$ where (i) the state space corresponds to the directions from which rooms are entered, (ii) the actions to the choices of the planner, and (iii) the transition probabilities to the values achieved by the latent policy chosen (Thm. \ref{['thm:succint-mdp-plan']}).
...and 8 more figures

Theorems & Definitions (15)

Example 1: Room
Lemma 1: DBLP:conf/aaai/DelgrangeN022
Theorem 1
Lemma 2
Theorem 2: The value bounds are PAC learnable
Example 2: Planners require memory
Theorem 3
Example 3
Theorem 4
Theorem 5
...and 5 more

Composing Reinforcement Learning Policies, with Formal Guarantees

TL;DR

Abstract

Composing Reinforcement Learning Policies, with Formal Guarantees

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (15)