Table of Contents
Fetching ...

Reclaiming the Source of Programmatic Policies: Programmatic versus Latent Spaces

Tales H. Carvalho, Kenneth Tjhia, Levi H. S. Lelis

TL;DR

It is shown that the programmatic space, induced by the domain-specific language and requiring no training, presents values for the behavior loss similar to those observed in latent spaces presented in previous work, and algorithms searching in the programmatic space significantly outperform those in LEAPS and HPRL.

Abstract

Recent works have introduced LEAPS and HPRL, systems that learn latent spaces of domain-specific languages, which are used to define programmatic policies for partially observable Markov decision processes (POMDPs). These systems induce a latent space while optimizing losses such as the behavior loss, which aim to achieve locality in program behavior, meaning that vectors close in the latent space should correspond to similarly behaving programs. In this paper, we show that the programmatic space, induced by the domain-specific language and requiring no training, presents values for the behavior loss similar to those observed in latent spaces presented in previous work. Moreover, algorithms searching in the programmatic space significantly outperform those in LEAPS and HPRL. To explain our results, we measured the "friendliness" of the two spaces to local search algorithms. We discovered that algorithms are more likely to stop at local maxima when searching in the latent space than when searching in the programmatic space. This implies that the optimization topology of the programmatic space, induced by the reward function in conjunction with the neighborhood function, is more conducive to search than that of the latent space. This result provides an explanation for the superior performance in the programmatic space.

Reclaiming the Source of Programmatic Policies: Programmatic versus Latent Spaces

TL;DR

It is shown that the programmatic space, induced by the domain-specific language and requiring no training, presents values for the behavior loss similar to those observed in latent spaces presented in previous work, and algorithms searching in the programmatic space significantly outperform those in LEAPS and HPRL.

Abstract

Recent works have introduced LEAPS and HPRL, systems that learn latent spaces of domain-specific languages, which are used to define programmatic policies for partially observable Markov decision processes (POMDPs). These systems induce a latent space while optimizing losses such as the behavior loss, which aim to achieve locality in program behavior, meaning that vectors close in the latent space should correspond to similarly behaving programs. In this paper, we show that the programmatic space, induced by the domain-specific language and requiring no training, presents values for the behavior loss similar to those observed in latent spaces presented in previous work. Moreover, algorithms searching in the programmatic space significantly outperform those in LEAPS and HPRL. To explain our results, we measured the "friendliness" of the two spaces to local search algorithms. We discovered that algorithms are more likely to stop at local maxima when searching in the latent space than when searching in the programmatic space. This implies that the optimization topology of the programmatic space, induced by the reward function in conjunction with the neighborhood function, is more conducive to search than that of the latent space. This result provides an explanation for the superior performance in the programmatic space.

Paper Structure

This paper contains 41 sections, 5 equations, 12 figures, 5 tables, 2 algorithms.

Figures (12)

  • Figure 1: DSL for Karel the Robot as a context-free grammar.
  • Figure 2: An example of a program defined in Karel DSL (left) and its AST representation (right). In the AST, MP stands for markersPresent, PM for pickMarker, and M for move.
  • Figure 3: Episodic return performance of all methods in Karel and Karel-Hard problem sets. Reported mean and $95\%$ confidence interval over $32$ seeds. The x-axis is represented in log scale.
  • Figure 4: Behavior-similarity and identity-rate metrics on Programmatic Space and Latent Space ($\sigma=\{0.1,0.25,0.5\}$). Reported mean and $95\%$ confidence interval of the estimation of each metric over a set of $32$ initial states of the environment and $1,000$ seeds for the initial programs.
  • Figure 5: Convergence rate of the hill-climbing algorithm in the Programmatic Space and in the Latent Space with neighborhood size $K=250$. Reported mean and $95\%$ confidence interval of estimation over a set of $10,000$ initial candidates. The plots for DoorKey and Snake show a zoomed-in region highlighting runs of the search that achieve reward values larger than $0.5$.
  • ...and 7 more figures