Table of Contents
Fetching ...

Boosting deep Reinforcement Learning using pretraining with Logical Options

Zihan Ye, Phil Chau, Raban Emunds, Jannis Blüml, Cedric Derstroff, Quentin Delfosse, Oleg Arenz, Kristian Kersting

TL;DR

This work uses a two-stage framework that injects symbolic structure into neural-based reinforcement learning agents without sacrificing the expressivity of deep policies, and introduces a logical option-based pretraining strategy to steer the learning policy away from short-term reward loops and toward goal-directed behavior.

Abstract

Deep reinforcement learning agents are often misaligned, as they over-exploit early reward signals. Recently, several symbolic approaches have addressed these challenges by encoding sparse objectives along with aligned plans. However, purely symbolic architectures are complex to scale and difficult to apply to continuous settings. Hence, we propose a hybrid approach, inspired by humans' ability to acquire new skills. We use a two-stage framework that injects symbolic structure into neural-based reinforcement learning agents without sacrificing the expressivity of deep policies. Our method, called Hybrid Hierarchical RL (H^2RL), introduces a logical option-based pretraining strategy to steer the learning policy away from short-term reward loops and toward goal-directed behavior while allowing the final policy to be refined via standard environment interaction. Empirically, we show that this approach consistently improves long-horizon decision-making and yields agents that outperform strong neural, symbolic, and neuro-symbolic baselines.

Boosting deep Reinforcement Learning using pretraining with Logical Options

TL;DR

This work uses a two-stage framework that injects symbolic structure into neural-based reinforcement learning agents without sacrificing the expressivity of deep policies, and introduces a logical option-based pretraining strategy to steer the learning policy away from short-term reward loops and toward goal-directed behavior.

Abstract

Deep reinforcement learning agents are often misaligned, as they over-exploit early reward signals. Recently, several symbolic approaches have addressed these challenges by encoding sparse objectives along with aligned plans. However, purely symbolic architectures are complex to scale and difficult to apply to continuous settings. Hence, we propose a hybrid approach, inspired by humans' ability to acquire new skills. We use a two-stage framework that injects symbolic structure into neural-based reinforcement learning agents without sacrificing the expressivity of deep policies. Our method, called Hybrid Hierarchical RL (H^2RL), introduces a logical option-based pretraining strategy to steer the learning policy away from short-term reward loops and toward goal-directed behavior while allowing the final policy to be refined via standard environment interaction. Empirically, we show that this approach consistently improves long-horizon decision-making and yields agents that outperform strong neural, symbolic, and neuro-symbolic baselines.
Paper Structure (25 sections, 13 equations, 4 figures, 16 tables)

This paper contains 25 sections, 13 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: Deep reinforcement learning policies are often misaligned, exemplified on neural PPO agents. Although the oxygen is running low in Seaquest (left) and the goal in Kangaroo (right) is to go up, PPO agent fails to choose the optimal actions (in green). Instead, they focus on immediate rewards, e.g., keep attacking enemies.
  • Figure 2: Overview of the framework. Through logic-informed pretraining, H$^2$RL embeds logic priors directly into neural policies, thereby addressing the deep policy misalignment issue. H$^2$RL provides a two-stage training paradigm. In the first stage, the deep policy is jointly trained with the logic manager and the gating module (referred to as deep policy pretraining). In the second stage, the deep policy is further trained through direct interaction with the environment (referred to as deep policy post-training). See Sec. \ref{['HHRL']} for details.
  • Figure 3: Leveraging logic-informed pretraining, H$^2$RL with its variants (bolded), outperforms baselines on challenging ALE tasks (Seaquest, Kangaroo, and DonkeyKong) with long-horizon dependencies and reward traps. Although DQN and PPO achieve high returns in Kangaroo, their learned policies remain misaligned; see Sec. \ref{['experiments']} (RQ1 and RQ3) for details. Episodic returns are averaged over 12 environments (with 200 runs per environment). Results are presented on a log scale to normalize for the disparate reward magnitudes across games.
  • Figure 4: H$^2$RL effectively leverages logic reasoning in continuous action spaces and improves deep agents' performance. We compare H$^2$RL with methods applicable to continuous action space on the Kangaroo and DonkeyKong tasks in CALE, where H$^2$RL consistently outperforms these baselines. Details see Sec. \ref{['experiments']}: RQ4.