Table of Contents
Fetching ...

Learning Successor Features the Simple Way

Raymond Chua, Arna Ghosh, Christos Kaplanis, Blake A. Richards, Doina Precup

TL;DR

This work provides a new, streamlined technique for learning SFs directly from pixel observations, with no pretraining required, and shows that this approach matches or outperforms existing SF learning techniques in both 2D and 3D mazes and Mujoco, for both single and continual learning scenarios.

Abstract

In Deep Reinforcement Learning (RL), it is a challenge to learn representations that do not exhibit catastrophic forgetting or interference in non-stationary environments. Successor Features (SFs) offer a potential solution to this challenge. However, canonical techniques for learning SFs from pixel-level observations often lead to representation collapse, wherein representations degenerate and fail to capture meaningful variations in the data. More recent methods for learning SFs can avoid representation collapse, but they often involve complex losses and multiple learning phases, reducing their efficiency. We introduce a novel, simple method for learning SFs directly from pixels. Our approach uses a combination of a Temporal-difference (TD) loss and a reward prediction loss, which together capture the basic mathematical definition of SFs. We show that our approach matches or outperforms existing SF learning techniques in both 2D (Minigrid), 3D (Miniworld) mazes and Mujoco, for both single and continual learning scenarios. As well, our technique is efficient, and can reach higher levels of performance in less time than other approaches. Our work provides a new, streamlined technique for learning SFs directly from pixel observations, with no pretraining required.

Learning Successor Features the Simple Way

TL;DR

This work provides a new, streamlined technique for learning SFs directly from pixel observations, with no pretraining required, and shows that this approach matches or outperforms existing SF learning techniques in both 2D and 3D mazes and Mujoco, for both single and continual learning scenarios.

Abstract

In Deep Reinforcement Learning (RL), it is a challenge to learn representations that do not exhibit catastrophic forgetting or interference in non-stationary environments. Successor Features (SFs) offer a potential solution to this challenge. However, canonical techniques for learning SFs from pixel-level observations often lead to representation collapse, wherein representations degenerate and fail to capture meaningful variations in the data. More recent methods for learning SFs can avoid representation collapse, but they often involve complex losses and multiple learning phases, reducing their efficiency. We introduce a novel, simple method for learning SFs directly from pixels. Our approach uses a combination of a Temporal-difference (TD) loss and a reward prediction loss, which together capture the basic mathematical definition of SFs. We show that our approach matches or outperforms existing SF learning techniques in both 2D (Minigrid), 3D (Miniworld) mazes and Mujoco, for both single and continual learning scenarios. As well, our technique is efficient, and can reach higher levels of performance in less time than other approaches. Our work provides a new, streamlined technique for learning SFs directly from pixel observations, with no pretraining required.

Paper Structure

This paper contains 70 sections, 1 theorem, 21 equations, 55 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Optimizing $\nabla_{\psi} L_{\psi} \simeq \boldsymbol{w}^{\top} \nabla_{\psi} L_{\text{SF}} \boldsymbol{w}$, where $L_{\text{SF}}$ is the canonical loss for universal successor features Borsa_2018.

Figures (55)

  • Figure 1: (a) Results from a single task within a 2D two-room environment, illustrating the suboptimal performance of the canonical Successor Features (SF) learning rule (Eq. \ref{['eq:canonical_sf_td_loss']}) due to representation collapse. (b) In the canonical SF approach, the average cosine similarity between pairs of SFs converges towards a value of 1, demonstrating representation collapse occurs. (c) The canonical SF learning rule does not develop distinct clusters in its representations, as evidenced by lower silhouette scores and higher Davies-Bouldin scores, which again indicates representation collapse. A mathematical proof can be found in section \ref{['subsection:representation_collapse_proof']}.
  • Figure 2: Our proposed model for learning SFs. Starting from the top, the representations of state $S_t$ are learned using the shared encoder, resulting in $h_t$. The basis features $\phi(S_{t+1})$ are the normalized output of the encoder using state $S_{t+1}$. The task-encoding vector $\boldsymbol{w}$ is learned through the reward prediction loss (Eq. \ref{['eq:r_pr_loss']}). Concatenated with $w$, the basis features and successor features are learned through computing the Q-values with $\boldsymbol{w}$ and minimizing the Q-SF-TD loss function (Eq. \ref{['eq:sf_td_loss']}). A schematic for continuous actions and previous approaches can be found in Appendix \ref{['section:our_model_continous']} and \ref{['section:previous_models']} respectively.
  • Figure 3: Continual Reinforcement Learning Evaluation with pixel observations in 2D Minigrid and 3D Four Rooms environment. Replay buffer resets at each task transitions to simulate drastic distribution shifts: Agents face two sequential tasks (Task 1 & Task 2), each repeated twice (Exposure 1 & Exposure 2). (a-c): The total cumulative returns accumulated during training. Overall, our agent, Simple SF (orange), shows notable superiority and exhibited better transfer in later tasks over both DQN (blue) and agents with added constraints. Importantly, constraints like reconstruction and orthogonality on basis features can impede learning. The plots for moving average episode returns are available in Appendix \ref{['subsection:moving_avg_reset_replay_plot']} for additional insights.
  • Figure 4: Continual Reinforcement Learning results using pixel observations in Mujoco environment across 5 random seeds. Replay buffer resets at each task transitions to simulate drastic distribution shifts. we started with the half-cheetah domain in Task 1 where agents were rewarded for running forward. We then introduced three different scenarios in Task 2: (a) agents were rewarded for running backwards, (b) running faster, and, in the most drastic change, (c) switching from the half-cheetah to the walker domain with a forward running task. To ensure comparability across these diverse scenarios, we normalized the returns, considering that each task has different maximum attainable returns per episode. We did not evaluate APS (Pre-train) here because it struggles in the Continual RL setting, even in simpler environments such as the 2D Minigrid and 3D Miniworld.
  • Figure 5: Decoding performance comparison of models' SFs into SRs using a non-linear decoder in the Center-Wall environment. Ground truth SRs are generated analytically using Eq. \ref{['eq:analytical_sr']}, described in Appendix \ref{['section: Correlation analysis']}. Lower Mean Squared Error values on the y-axis indicate better performance.
  • ...and 50 more figures

Theorems & Definitions (1)

  • Proposition 1