Programmatic Reinforcement Learning: Navigating Gridworlds

Guruprerana Shabadi; Nathanaël Fijalkow; Théo Matricon

Programmatic Reinforcement Learning: Navigating Gridworlds

Guruprerana Shabadi, Nathanaël Fijalkow, Théo Matricon

TL;DR

This work initiates a theoretical study of programmatic reinforcement learning by modeling policies as executable programs in gridworld environments. It introduces a expressive subgoal DSL and develops a backward tree algorithm that characterizes the set of states from which the target is reachable, proving the existence of optimal programmatic policies and giving upper bounds on their size. A constructive policy synthesis procedure derives compact programmatic policies from shortest-path trees, with a polynomial-size bound of $O(|\text{Regions}|^4)$ and a storage-space analysis under discretized endpoints; a Python prototype accompanies the theory. The results support the interpretability, verifiability, and potential generalization of programmatic policies, and outline future directions in verification, learning, and generalization beyond the gridworld setting.

Abstract

The field of reinforcement learning (RL) is concerned with algorithms for learning optimal policies in unknown stochastic environments. Programmatic RL studies representations of policies as programs, meaning involving higher order constructs such as control loops. Despite attracting a lot of attention at the intersection of the machine learning and formal methods communities, very little is known on the theoretical front about programmatic RL: what are good classes of programmatic policies? How large are optimal programmatic policies? How can we learn them? The goal of this paper is to give first answers to these questions, initiating a theoretical study of programmatic RL. Considering a class of gridworld environments, we define a class of programmatic policies. Our main contributions are to place upper bounds on the size of optimal programmatic policies, and to construct an algorithm for synthesizing them. These theoretical findings are complemented by a prototype implementation of the algorithm.

Programmatic Reinforcement Learning: Navigating Gridworlds

TL;DR

and a storage-space analysis under discretized endpoints; a Python prototype accompanies the theory. The results support the interpretability, verifiability, and potential generalization of programmatic policies, and outline future directions in verification, learning, and generalization beyond the gridworld setting.

Abstract

Paper Structure (16 sections, 6 theorems, 14 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 16 sections, 6 theorems, 14 equations, 9 figures, 1 table, 1 algorithm.

Introduction
Gridworlds
The subgoal DSL
Inspiration
Programmatic subgoal policies
The tree of shortest paths
The backward algorithm
Properties of the tree
Constructing programmatic policies
Conclusions and future work
Illustration of the tree
Proof of the third item of Lemma \ref{['lemma:edges-two-ortho-dirs']}
Correctness of the compression algorithm: Proof of Theorem \ref{['theorem:synthesis-correctness']}
Analysis of the size of programs: Proof of Theorem \ref{['theorem:synthesis-size']}
Analysis of the bitsize of programs: Proof of Lemma \ref{['lemma:size-segment']}
...and 1 more sections

Key Result

Lemma 1

If $v_1$ and $v_2$ belong to the same region and there exists a path from $v_1$ to $v_2$ inside this region, then there exists a move between $v_1$ and $v_2$.

Figures (9)

Figure 1: Spiral
Figure 2: Double pass triangle
Figure 3: Images published in Life magazine in 1952 which show the path taken by Claude Shannon's mechanical mouse first while learning to navigate the maze and the direct path taken in its second attempt.
Figure 4: Backward construction of the tree of the winning region
Figure 5: Any path which visits $[a_{k+1}, b_{k+1}] \to$$[a_i, b_i]\to$$[a_{i+1}, b_{i+1}]\to$$[a_j, b_j] \to$$[a_{j+1}, b_{j+1}]$ is necessarily self-intersecting.
...and 4 more figures

Theorems & Definitions (15)

Example 1: Spiral
Lemma 1
Example 2: Double pass triangle
Example 3
Example 4
Lemma 2
proof
Lemma 3
proof
Theorem 1
...and 5 more

Programmatic Reinforcement Learning: Navigating Gridworlds

TL;DR

Abstract

Programmatic Reinforcement Learning: Navigating Gridworlds

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (15)