Table of Contents
Fetching ...

Programmatic Reinforcement Learning: Navigating Gridworlds

Guruprerana Shabadi, Nathanaël Fijalkow, Théo Matricon

TL;DR

This work initiates a theoretical study of programmatic reinforcement learning by modeling policies as executable programs in gridworld environments. It introduces a expressive subgoal DSL and develops a backward tree algorithm that characterizes the set of states from which the target is reachable, proving the existence of optimal programmatic policies and giving upper bounds on their size. A constructive policy synthesis procedure derives compact programmatic policies from shortest-path trees, with a polynomial-size bound of $O(|\text{Regions}|^4)$ and a storage-space analysis under discretized endpoints; a Python prototype accompanies the theory. The results support the interpretability, verifiability, and potential generalization of programmatic policies, and outline future directions in verification, learning, and generalization beyond the gridworld setting.

Abstract

The field of reinforcement learning (RL) is concerned with algorithms for learning optimal policies in unknown stochastic environments. Programmatic RL studies representations of policies as programs, meaning involving higher order constructs such as control loops. Despite attracting a lot of attention at the intersection of the machine learning and formal methods communities, very little is known on the theoretical front about programmatic RL: what are good classes of programmatic policies? How large are optimal programmatic policies? How can we learn them? The goal of this paper is to give first answers to these questions, initiating a theoretical study of programmatic RL. Considering a class of gridworld environments, we define a class of programmatic policies. Our main contributions are to place upper bounds on the size of optimal programmatic policies, and to construct an algorithm for synthesizing them. These theoretical findings are complemented by a prototype implementation of the algorithm.

Programmatic Reinforcement Learning: Navigating Gridworlds

TL;DR

This work initiates a theoretical study of programmatic reinforcement learning by modeling policies as executable programs in gridworld environments. It introduces a expressive subgoal DSL and develops a backward tree algorithm that characterizes the set of states from which the target is reachable, proving the existence of optimal programmatic policies and giving upper bounds on their size. A constructive policy synthesis procedure derives compact programmatic policies from shortest-path trees, with a polynomial-size bound of and a storage-space analysis under discretized endpoints; a Python prototype accompanies the theory. The results support the interpretability, verifiability, and potential generalization of programmatic policies, and outline future directions in verification, learning, and generalization beyond the gridworld setting.

Abstract

The field of reinforcement learning (RL) is concerned with algorithms for learning optimal policies in unknown stochastic environments. Programmatic RL studies representations of policies as programs, meaning involving higher order constructs such as control loops. Despite attracting a lot of attention at the intersection of the machine learning and formal methods communities, very little is known on the theoretical front about programmatic RL: what are good classes of programmatic policies? How large are optimal programmatic policies? How can we learn them? The goal of this paper is to give first answers to these questions, initiating a theoretical study of programmatic RL. Considering a class of gridworld environments, we define a class of programmatic policies. Our main contributions are to place upper bounds on the size of optimal programmatic policies, and to construct an algorithm for synthesizing them. These theoretical findings are complemented by a prototype implementation of the algorithm.
Paper Structure (16 sections, 6 theorems, 14 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 16 sections, 6 theorems, 14 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

If $v_1$ and $v_2$ belong to the same region and there exists a path from $v_1$ to $v_2$ inside this region, then there exists a move between $v_1$ and $v_2$.

Figures (9)

  • Figure 1: Spiral
  • Figure 2: Double pass triangle
  • Figure 3: Images published in Life magazine in 1952 which show the path taken by Claude Shannon's mechanical mouse first while learning to navigate the maze and the direct path taken in its second attempt.
  • Figure 4: Backward construction of the tree of the winning region
  • Figure 5: Any path which visits $[a_{k+1}, b_{k+1}] \to$$[a_i, b_i]\to$$[a_{i+1}, b_{i+1}]\to$$[a_j, b_j] \to$$[a_{j+1}, b_{j+1}]$ is necessarily self-intersecting.
  • ...and 4 more figures

Theorems & Definitions (15)

  • Example 1: Spiral
  • Lemma 1
  • Example 2: Double pass triangle
  • Example 3
  • Example 4
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Theorem 1
  • ...and 5 more