Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning

Michael Lanier; Ying Xu; Nathan Jacobs; Chongjie Zhang; Yevgeniy Vorobeychik

Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning

Michael Lanier, Ying Xu, Nathan Jacobs, Chongjie Zhang, Yevgeniy Vorobeychik

TL;DR

This work tackles learning interpretable policies in partially observable environments by exploiting training-time access to true state information. It introduces Partially Supervised Reinforcement Learning (PSRL), a framework that jointly trains a state predictor $g_\phi(o)$ and a policy $\tilde{\pi}_{\theta_a}$ using a mix of supervised and reinforcement learning losses, producing interpretable policies that operate on predicted semantic state at execution time as $\pi(o)=\tilde{\pi}_{\theta_a}(g_\phi(o))$. The PSRL-$K$ extension adds a latent input $z=h_\psi(o)$, forming $\pi(o)=\tilde{\pi}_{\theta_a}(g_\phi(o), h_\psi(o))$, enabling a spectrum between fully end-to-end RL and semantically-grounded control. Empirical results on five OpenAI Gym domains show that PSRL-0 often achieves better sample efficiency and interpretability than end-to-end methods, with latent augmentation offering limited extra gains in many cases; PSRL also yields more accurate semantic-state predictions than pure E2E embeddings, highlighting the practical benefits of hybrid supervisory signals in POMDPs.

Abstract

Deep reinforcement learning has demonstrated remarkable achievements across diverse domains such as video games, robotic control, autonomous driving, and drug discovery. Common methodologies in partially-observable domains largely lean on end-to-end learning from high-dimensional observations, such as images, without explicitly reasoning about true state. We suggest an alternative direction, introducing the Partially Supervised Reinforcement Learning (PSRL) framework. At the heart of PSRL is the fusion of both supervised and unsupervised learning. The approach leverages a state estimator to distill supervised semantic state information from high-dimensional observations which are often fully observable at training time. This yields more interpretable policies that compose state predictions with control. In parallel, it captures an unsupervised latent representation. These two-the semantic state and the latent state-are then fused and utilized as inputs to a policy network. This juxtaposition offers practitioners a flexible and dynamic spectrum: from emphasizing supervised state information to integrating richer, latent insights. Extensive experimental results indicate that by merging these dual representations, PSRL offers a potent balance, enhancing model interpretability while preserving, and often significantly outperforming, the performance benchmarks set by traditional methods in terms of reward and convergence speed.

Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning

TL;DR

and a policy

using a mix of supervised and reinforcement learning losses, producing interpretable policies that operate on predicted semantic state at execution time as

. The PSRL-

extension adds a latent input

, forming

, enabling a spectrum between fully end-to-end RL and semantically-grounded control. Empirical results on five OpenAI Gym domains show that PSRL-0 often achieves better sample efficiency and interpretability than end-to-end methods, with latent augmentation offering limited extra gains in many cases; PSRL also yields more accurate semantic-state predictions than pure E2E embeddings, highlighting the practical benefits of hybrid supervisory signals in POMDPs.

Abstract

Paper Structure (7 sections, 4 equations, 5 figures, 1 table)

This paper contains 7 sections, 4 equations, 5 figures, 1 table.

Introduction
Preliminaries
Approach
Experiments
Experiment Setup
Results
Conclusion

Figures (5)

Figure 1: The PSRL-$K$ framework.
Figure 2: Experiments in Acrobot (top row), Cart Pole (with finite action sets; middle row), and Mountain Car (bottom row), in finite-action environments using DDQN approaches.
Figure 3: Experiments in Pendulum (top row), Cart Pole (with continuous action sets; middle row), and Reacher (bottom row), in continuous-action environments using PPO approaches.
Figure 4: Experiments comparing PSRL-0 (joint RL and supervised) learning with either representation first or policy first approaches. Top row, left-to-right: Acrobot 50% pretrained, Acrobot 25% pretrained, Acrobot 10% pretrained. Middle row: Cartpole 50% pretrained, Cartpole 25% pretrained, Cartpole 10% pretrained. Bottom row: Mountain Car 50% pretrained, Mountain Car 25% pretrained, Mountain Car 10% pretrained.
Figure 5: Experiments comparing PSRL-0 (joint RL and supervised) learning with either representation first or policy first approaches. Top row, left-to-right: Pendulum 50% pretrained, Pendulum 25% pretrained, Pendulum 10% pretrained. Middle row: Continuous Cartpole 50% pretrained, Continuous Cartpole 25% pretrained, Continuous Cartpole 10% pretrained. Bottom row: Reacher 50% pretrained, Reacher 25% pretrained, Reacher 10% pretrained.

Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning

TL;DR

Abstract

Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)