Table of Contents
Fetching ...

Automaton Distillation: Neuro-Symbolic Transfer Learning for Deep Reinforcement Learning

Suraj Singireddy, Precious Nwaorgu, Andre Beckus, Aden McKinney, Chinwendu Enyioha, Sumit Kumar Jha, George K. Atia, Alvaro Velasquez

TL;DR

It is demonstrated that automaton distillation decreases the time required to find optimal policies for various decision tasks in new environments, even in a target environment different in structure from the source environment.

Abstract

Reinforcement learning (RL) is a powerful tool for finding optimal policies in sequential decision processes. However, deep RL methods have two weaknesses: collecting the amount of agent experience required for practical RL problems is prohibitively expensive, and the learned policies exhibit poor generalization on tasks outside the training data distribution. To mitigate these issues, we introduce automaton distillation, a form of neuro-symbolic transfer learning in which Q-value estimates from a teacher are distilled into a low-dimensional representation in the form of an automaton. We then propose methods for generating Q-value estimates where symbolic information is extracted from a teacher's Deep Q-Network (DQN). The resulting Q-value estimates are used to bootstrap learning in the target discrete and continuous environment via a modified DQN and Twin-Delayed Deep Deterministic (TD3) loss function, respectively. We demonstrate that automaton distillation decreases the time required to find optimal policies for various decision tasks in new environments, even in a target environment different in structure from the source environment.

Automaton Distillation: Neuro-Symbolic Transfer Learning for Deep Reinforcement Learning

TL;DR

It is demonstrated that automaton distillation decreases the time required to find optimal policies for various decision tasks in new environments, even in a target environment different in structure from the source environment.

Abstract

Reinforcement learning (RL) is a powerful tool for finding optimal policies in sequential decision processes. However, deep RL methods have two weaknesses: collecting the amount of agent experience required for practical RL problems is prohibitively expensive, and the learned policies exhibit poor generalization on tasks outside the training data distribution. To mitigate these issues, we introduce automaton distillation, a form of neuro-symbolic transfer learning in which Q-value estimates from a teacher are distilled into a low-dimensional representation in the form of an automaton. We then propose methods for generating Q-value estimates where symbolic information is extracted from a teacher's Deep Q-Network (DQN). The resulting Q-value estimates are used to bootstrap learning in the target discrete and continuous environment via a modified DQN and Twin-Delayed Deep Deterministic (TD3) loss function, respectively. We demonstrate that automaton distillation decreases the time required to find optimal policies for various decision tasks in new environments, even in a target environment different in structure from the source environment.
Paper Structure (7 sections, 11 equations, 7 figures, 1 algorithm)

This paper contains 7 sections, 11 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: Example environment configurations for the Blind Craftsman teacher (a) and student (b) environments with additional obstacles introduced. (c) A continuous state and action student environment where the yellow, green, and orange cubes represent the agent, wood, factory, and home, respectively, placed at random positions in the continuous space.
  • Figure 2: (a) A simple NMRDP. At each time step, the agent may move one square in any cardinal direction. A sequence of actions satisfies the objective if and only if the agent obtains both the sword and the shield. The objective is decomposed using the atomic propositions $AP = \{\text{sword}, \text{shield}\}$, with a labeling function $L$ such that $L(s_0) = \{\}, L(s_1) = \{\}, L(s_2) = \{\text{sword}\}, L(s_3) = \{\text{shield}\}$. Rollouts which achieve the objective also satisfy the LTL$_f$ specification $\phi = \textbf{F}(\text{sword}) \wedge \textbf{F}(\text{shield})$. (b) An automaton defined over the alphabet $\Sigma = \{\{\}, \{\text{sword}\}, \{\text{shield}\}, \{\text{sword, shield}\}\}$. The automaton accepts the subset of strings in $\Sigma^*$ that satisfy the LTL$_f$ formula.
  • Figure 3: Example $7 \times 7$ environment configurations for the Blind Craftsman (a), Dungeon Quest (b), and Diamond Mine (c) environments.
  • Figure 4: Simple automaton with two traces.
  • Figure 5: Reward per episode (y-axis) over time (x-axis) during training using dynamic automaton distillation (blue) vs. static automaton distillation (orange), Counterfactual Reward Machine (CRM) (green), Q-learning over the product MDP (red), and vanilla Q-learning (purple) on the Dungeon Quest (a), Diamond Mine (b) and Blind Craftsman (c) environments. Each line is an ensemble average over 20 trials. Blind Craftsman with Obstacles (see Figure \ref{['fig:blind_craftsman_obstacles']}) is shown in (d). Dynamic automaton distillation for discrete to continuous Blind Craftsman, Diamond Mine, and Dungeon Quest are captured in (e), (f), and (g), respectively.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Definition 1: Non-Markovian Reward Decision Process (NMRDP)
  • Definition 2: Deterministic Finite-State Automaton (DFA)
  • Remark 1
  • Definition 3: Cross-Product Markov Decision Process
  • Definition 4: Finite-Trace Linear Temporal Logic (LTL$_f$)