Policy Gradient Methods in the Presence of Symmetries and State Abstractions

Prakash Panangaden; Sahand Rezaei-Shoshtari; Rosie Zhao; David Meger; Doina Precup

Policy Gradient Methods in the Presence of Symmetries and State Abstractions

Prakash Panangaden, Sahand Rezaei-Shoshtari, Rosie Zhao, David Meger, Doina Precup

TL;DR

The definition of Markov decision process (MDP) homomorphisms is extended to the setting of continuous state and action spaces, and a policy gradient theorem is derived on the abstract MDP for both stochastic and deterministic policies.

Abstract

Reinforcement learning (RL) on high-dimensional and complex problems relies on abstraction for improved efficiency and generalization. In this paper, we study abstraction in the continuous-control setting, and extend the definition of Markov decision process (MDP) homomorphisms to the setting of continuous state and action spaces. We derive a policy gradient theorem on the abstract MDP for both stochastic and deterministic policies. Our policy gradient results allow for leveraging approximate symmetries of the environment for policy optimization. Based on these theorems, we propose a family of actor-critic algorithms that are able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. Finally, we introduce a series of environments with continuous symmetries to further demonstrate the ability of our algorithm for action abstraction in the presence of such symmetries. We demonstrate the effectiveness of our method on our environments, as well as on challenging visual control tasks from the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance, and the visualizations of the latent space clearly demonstrate the structure of the learned abstraction.

Policy Gradient Methods in the Presence of Symmetries and State Abstractions

TL;DR

Abstract

Paper Structure (57 sections, 15 theorems, 55 equations, 23 figures, 1 table, 1 algorithm)

This paper contains 57 sections, 15 theorems, 55 equations, 23 figures, 1 table, 1 algorithm.

Introduction
Related Work
State Abstraction.
Action Abstraction.
State Representation Learning.
Equivariant Representation Learning.
Background
Markov Decision Processes
Policy Gradient Theorems
Bisimulation and Bisimulation Metrics
Finite MDP Homomorphisms
Continuous MDP Homomorphisms
Optimal Value Equivalence
Lifting Policies and Value Equivalence
Homomorphic Policy Gradient
...and 42 more sections

Key Result

Theorem 2

Let $\pi_\theta : \mathcal{S} \to \Delta(\mathcal{A})$ be a stochastic policy defined on $\mathcal{M}$. Then the gradient of the performance measure $J(\theta)$ w.r.t. $\theta$ is: where $\rho^{\pi_\theta} (s) = \lim_{t \rightarrow \infty} \gamma^t P(s_t=s | s_0, a_{0:t} \sim \pi_\theta)$ is the discounted stationary distribution of states under $\pi_\theta$.

Figures (23)

Figure 1: Overview of an MDP homomorphism $h = (f, g_s)$. (a) Components of an MDP homomorphism map, and the relation between the actual and abstract MDPs. (b) Commutative diagrams for MDP homomorphisms demonstrating the equivariance of transitions and the invariance of rewards. Diagram is adapted from ravindran2001symmetries.
Figure 2: Schematics of HPG. The actual MDP $\mathcal{M}$ is used to train $Q^{\pi^\uparrow}$ and update $\pi^\uparrow$ with the standard PG theorem, while the abstract MDP ${\overline{\mathcal{M}}}$ is used to train $\overline{Q}^{\overline{\pi}}$ and update $\overline{\pi}$ with the homomorphic PG theorem. ${\overline{\mathcal{M}}}$ is the MDP homomorphic image of $\mathcal{M}$ obtained by learning the homomorphism map $h \!=\! ( f, g_s )$. The policies $\pi^\uparrow$ and $\overline{\pi}$ are coupled together through the lifting procedure.
Figure 3: Results of DM Control tasks with pixel observations obtained on 10 seeds. RLiable metrics are aggregated over 14 tasks. All methods are with image augmentation. (a) RLiable IQM scores as a function of number of steps for comparing sample efficiency, (b) RLiable performance profiles at 500k steps, (c) learning curves on the pendulum swingup task. Full results are in Appendix \ref{['sec:additional_results_pixels']}. Shaded regions represent $95\%$ confidence intervals.
Figure 4: Effectiveness of DHPG in recovering the minimal MDP from pixels. All methods are limited to a 4-dimensional latent space which is equal to the dimensions of the real state space of cartpole. (a) Trajectories of real states obtained from Mujoco and trajectories of latent states of DHPG. (b, c) Learning curves averaged on 10 seeds.
Figure 5: Contours of actual and abstract optimal actions over the state space of the pendulum-swingup task. Colors represent action values, and states are $s \!=\! (\theta, \dot{\theta})$. (a) Actual optimal policy; contours of optimal actions $a^* \!=\! \pi^{\uparrow^*}\!(s)$. (b) Abstract optimal policy; contours of abstract optimal actions $\overline{a}^* \!=\! g_s(a^*) \!=\! \overline{\pi}^*(\overline{s})$. The relation $g_{s_1}\!(a_1) \!=\! g_{s_2}\!(a_2)$ holds for equivalent state-action pairs, and the abstract optimal policy is symmetric.
...and 18 more figures

Theorems & Definitions (33)

Definition 1: MDP
Theorem 2: sutton2000policy
Theorem 3: silver2014deterministic
Definition 4: Bisimulation
Definition 5: MDP Homomorphism
Theorem 6: ravindran2001symmetries
Definition 7: Lifted Policy
Theorem 8: ravindran2001symmetries
Theorem 9: rezaei2022continuous
Definition 10: Continuous MDP
...and 23 more

Policy Gradient Methods in the Presence of Symmetries and State Abstractions

TL;DR

Abstract

Policy Gradient Methods in the Presence of Symmetries and State Abstractions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (23)

Theorems & Definitions (33)