Equivariant Reinforcement Learning under Partial Observability

Hai Nguyen; Andrea Baisero; David Klee; Dian Wang; Robert Platt; Christopher Amato

Equivariant Reinforcement Learning under Partial Observability

Hai Nguyen, Andrea Baisero, David Klee, Dian Wang, Robert Platt, Christopher Amato

TL;DR

This work addresses learning under partial observability by leveraging rotational symmetries through equivariant neural architectures within a POMDP framework. It extends group-invariant MDP theory to POMDPs, proving that the optimal value and policy can be made invariant/equivariant under the symmetry group $G$, and implements this via an equivariant actor and invariant critic, including an equivariant LSTM. The proposed Equi-RA2C and Equi-RSAC architectures demonstrate superior sample efficiency and final performance on grid-world and robotic manipulation tasks, with zero-shot sim-to-real transfer on a UR5 robot. The results highlight the practical impact of embedding symmetry into representation and recurrence for partial observability, while acknowledging sensitivity to imperfect symmetry and suggesting future work on robustness to asymmetries.

Abstract

Incorporating inductive biases is a promising approach for tackling challenging robot learning domains with sample-efficient solutions. This paper identifies partially observable domains where symmetries can be a useful inductive bias for efficient learning. Specifically, by encoding the equivariance regarding specific group symmetries into the neural networks, our actor-critic reinforcement learning agents can reuse solutions in the past for related scenarios. Consequently, our equivariant agents outperform non-equivariant approaches significantly in terms of sample efficiency and final performance, demonstrated through experiments on a range of robotic tasks in simulation and real hardware.

Equivariant Reinforcement Learning under Partial Observability

TL;DR

, and implements this via an equivariant actor and invariant critic, including an equivariant LSTM. The proposed Equi-RA2C and Equi-RSAC architectures demonstrate superior sample efficiency and final performance on grid-world and robotic manipulation tasks, with zero-shot sim-to-real transfer on a UR5 robot. The results highlight the practical impact of embedding symmetry into representation and recurrence for partial observability, while acknowledging sensitivity to imperfect symmetry and suggesting future work on robustness to asymmetries.

Abstract

Paper Structure (67 sections, 2 theorems, 12 equations, 30 figures, 2 tables)

This paper contains 67 sections, 2 theorems, 12 equations, 30 figures, 2 tables.

Introduction
Related Works
Background
Partially Observable Markov Decision Processes
$C_n$ and SO(2) Symmetry Groups
Group Representations
Equivariance, Invariance, and Group-invariant MDPs
Group-Invariant POMDPs
Equivariant Actor-Critic RL for POMDPs
Equivariant Modules
Experiments
Domains
Grid-World Domains
Robot Manipulation Domains
Agents
...and 52 more sections

Key Result

Theorem 1

A group-invariant POMDP has an invariant optimal Q-function $Q^*(g h, g a)=Q^*(h, a)$, an invariant optimal value function $V^*(g h) = V^*(h)$, and at least one equivariant deterministic optimal policy $\pi^*(g h) = g\pi^*(h)$.

Figures (30)

Figure 1: Drawer-Opening: This POMDP is rotationally symmetric in the sense that an optimal solution to the problem on the left (in blue) can be rotated to obtain an optimal solution to a rotated version of the problem on the right (in red).
Figure 2: Illustration of a pixel-wise rotation (characterized by a fixed representation $\rho_f$) and a channel-wise rotation (characterized by the representation $\rho$). When $g$ is a $\pi/2$ CCW rotation, $\rho_f$always rotates the pixels while the effect of $\rho$ varies, e.g., the effect when $\rho$: (a) being a trivial representation ($\rho_t$) acting on a 1-channel feature map, (b) being a standard representation ($\rho_s$) acting on a vector field, and (c) being a regular representation ($\rho_r$) acting on a 2-channel feature map.
Figure 3: Our equivariant agent takes the commonly used structure of a memory-based actor-critic agent ni2021recurrentha2018worldzintgraf2019varibadhung2019optimizing but consists of an equivariant actor and an invariant critic, each constructed by equivariant modules. The actor's output can be learned means and standard deviations (for continuous action spaces) or a categorical distribution over the action space (for discrete action spaces).
Figure 4: Equivariant Feature Extractor and Actor/Critic Outputter modules.
Figure 5: Equi. LSTM cell.
...and 25 more figures

Theorems & Definitions (7)

Definition 1
Definition 2
Theorem 1
proof
Lemma 1
proof : Proof By Induction
proof

Equivariant Reinforcement Learning under Partial Observability

TL;DR

Abstract

Equivariant Reinforcement Learning under Partial Observability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (30)

Theorems & Definitions (7)