Table of Contents
Fetching ...

Program Machine Policy: Addressing Long-Horizon Tasks by Integrating Program Synthesis and State Machines

Yu-An Lin, Chen-Tao Lee, Guan-Ting Liu, Pu-Jen Cheng, Shao-Hua Sun

TL;DR

Program Machine Policy (POMP) tackles the dual challenges of interpretability and long-horizon planning in reinforcement learning by representing policies as a state machine whose internal modes are human-readable programs in a Karel domain-specific language. It builds a smooth program embedding space, retrieves a diverse and compatible set of mode programs via an enhanced Cross-Entropy Method (CEM) with a diversity multiplier and compatibility checks, and trains a transition function with PPO to switch between modes to maximize discounted return in finite-horizon MDPs. On a suite of Karel tasks, including long-horizon benchmarks, POMP outperforms both deep RL and prior programmatic RL baselines, and demonstrates inductive generalization to horizons beyond training without fine-tuning. Ablation studies validate the necessity of diversity and compatibility in mode retrieval and show that the interpretable state-machine structure yields robust, repetitive-subroutine execution, with the option to extract an explicit FSM for improved explainability.

Abstract

Deep reinforcement learning (deep RL) excels in various domains but lacks generalizability and interpretability. On the other hand, programmatic RL methods (Trivedi et al., 2021; Liu et al., 2023) reformulate RL tasks as synthesizing interpretable programs that can be executed in the environments. Despite encouraging results, these methods are limited to short-horizon tasks. On the other hand, representing RL policies using state machines (Inala et al., 2020) can inductively generalize to long-horizon tasks; however, it struggles to scale up to acquire diverse and complex behaviors. This work proposes the Program Machine Policy (POMP), which bridges the advantages of programmatic RL and state machine policies, allowing for the representation of complex behaviors and the address of long-term tasks. Specifically, we introduce a method that can retrieve a set of effective, diverse, and compatible programs. Then, we use these programs as modes of a state machine and learn a transition function to transition among mode programs, allowing for capturing repetitive behaviors. Our proposed framework outperforms programmatic RL and deep RL baselines on various tasks and demonstrates the ability to inductively generalize to even longer horizons without any fine-tuning. Ablation studies justify the effectiveness of our proposed search algorithm for retrieving a set of programs as modes.

Program Machine Policy: Addressing Long-Horizon Tasks by Integrating Program Synthesis and State Machines

TL;DR

Program Machine Policy (POMP) tackles the dual challenges of interpretability and long-horizon planning in reinforcement learning by representing policies as a state machine whose internal modes are human-readable programs in a Karel domain-specific language. It builds a smooth program embedding space, retrieves a diverse and compatible set of mode programs via an enhanced Cross-Entropy Method (CEM) with a diversity multiplier and compatibility checks, and trains a transition function with PPO to switch between modes to maximize discounted return in finite-horizon MDPs. On a suite of Karel tasks, including long-horizon benchmarks, POMP outperforms both deep RL and prior programmatic RL baselines, and demonstrates inductive generalization to horizons beyond training without fine-tuning. Ablation studies validate the necessity of diversity and compatibility in mode retrieval and show that the interpretable state-machine structure yields robust, repetitive-subroutine execution, with the option to extract an explicit FSM for improved explainability.

Abstract

Deep reinforcement learning (deep RL) excels in various domains but lacks generalizability and interpretability. On the other hand, programmatic RL methods (Trivedi et al., 2021; Liu et al., 2023) reformulate RL tasks as synthesizing interpretable programs that can be executed in the environments. Despite encouraging results, these methods are limited to short-horizon tasks. On the other hand, representing RL policies using state machines (Inala et al., 2020) can inductively generalize to long-horizon tasks; however, it struggles to scale up to acquire diverse and complex behaviors. This work proposes the Program Machine Policy (POMP), which bridges the advantages of programmatic RL and state machine policies, allowing for the representation of complex behaviors and the address of long-term tasks. Specifically, we introduce a method that can retrieve a set of effective, diverse, and compatible programs. Then, we use these programs as modes of a state machine and learn a transition function to transition among mode programs, allowing for capturing repetitive behaviors. Our proposed framework outperforms programmatic RL and deep RL baselines on various tasks and demonstrates the ability to inductively generalize to even longer horizons without any fine-tuning. Ablation studies justify the effectiveness of our proposed search algorithm for retrieving a set of programs as modes.
Paper Structure (62 sections, 3 equations, 24 figures, 2 tables, 1 algorithm)

This paper contains 62 sections, 3 equations, 24 figures, 2 tables, 1 algorithm.

Figures (24)

  • Figure 1: Karel Domain-Specific Language (DSL), designed for describing the Karel agent's behaviors.
  • Figure 2: Learning Program Machine Policy.(a): Retrieving mode programs. After learning the program embedding space, we propose an advanced search scheme built upon the Cross-Entropy Method (CEM) to search programs $\rho_{m_1}, ..., \rho_{m_k}, \rho_{m_{k+1}}$ of different skills. While searching for the next mode program $\rho_{m_{k+1}}$, we consider its compatibility with previously determined mode programs $\rho_{m_1}, ..., \rho_{m_k}$ by randomly sampling a sequence of mode programs. We also consider the diversity among all mode programs using the diversity multiplier. (b): Learning the mode transition function. Given the current environment state $s$ and the current mode $m_\text{current}$, the mode transition function predicts the transition probability over each mode of the state machine with the aim of maximizing the total accumulative reward from the environment.
  • Figure 3: Karel-Long Problem Set: This work introduces a new set of tasks in the Karel domain. These tasks necessitate learning diverse, repetitive, and task-specific skills. For example, in our designed Inf-Harvester, the agent needs to traverse the whole map and pick nearly 400 markers to solve the tasks since the environment randomly generates markers; in contrast, the Harvester from the Karel problem set trivedi2021learning can be solved by picking just 36 markers.
  • Figure 4: (a) Program sample efficiency. The training curves of POMP and other programmatic RL approaches, where the x-axis is the total number of executed programs for interacting with the environment, and the y-axis is the maximum validation return. This demonstrates that our proposed framework has better program sample efficiency and converges to better performance. (b) Inductive generalization performance. We evaluate and report the performance drop in the testing environments with an extended horizon, where the x-axis is the extended horizon length compared to the horizon of the training environments, and the y-axis is the performance drop in percentage. Our proposed framework can inductively generalize to longer horizons without any fine-tuning.
  • Figure 5: Using the Cross-Entropy Method to search for a program with high execution reward in the learned program embedding space.
  • ...and 19 more figures