Table of Contents
Fetching ...

Momentum Boosted Episodic Memory for Improving Learning in Long-Tailed RL Environments

Dolton Fernandes, Pramod Kaushik, Harsh Shukla, Bapi Raju Surampudi

TL;DR

The paper targets learning under Zipfian, long-tail data distributions in reinforcement learning by integrating a fast/slow learning paradigm. It introduces a modular Momentum Boosted Episodic Memory (MEM) architecture that uses a familiarity buffer and momentum-based contrastive learning to identify and prioritize rare trajectories, then reinstates their hidden activations through an episodic memory module to improve policy decisions. The approach, which augments IMPALA with a contrastive learning branch and a memory retrieval mechanism, delivers superior performance across Zipfian tasks and Atari benchmarks, outperforming strong baselines and several ablations. This method offers a practical, architecture-agnostic pathway to enhance sample efficiency and long-term credit assignment in non-uniform environments, with potential applicability to more realistic 3D settings and real-world decision-making problems.

Abstract

Traditional Reinforcement Learning (RL) algorithms assume the distribution of the data to be uniform or mostly uniform. However, this is not the case with most real-world applications like autonomous driving or in nature where animals roam. Some experiences are encountered frequently, and most of the remaining experiences occur rarely; the resulting distribution is called Zipfian. Taking inspiration from the theory of complementary learning systems, an architecture for learning from Zipfian distributions is proposed where important long tail trajectories are discovered in an unsupervised manner. The proposal comprises an episodic memory buffer containing a prioritised memory module to ensure important rare trajectories are kept longer to address the Zipfian problem, which needs credit assignment to happen in a sample efficient manner. The experiences are then reinstated from episodic memory and given weighted importance forming the trajectory to be executed. Notably, the proposed architecture is modular, can be incorporated in any RL architecture and yields improved performance in multiple Zipfian tasks over traditional architectures. Our method outperforms IMPALA by a significant margin on all three tasks and all three evaluation metrics (Zipfian, Uniform, and Rare Accuracy) and also gives improvements on most Atari environments that are considered challenging

Momentum Boosted Episodic Memory for Improving Learning in Long-Tailed RL Environments

TL;DR

The paper targets learning under Zipfian, long-tail data distributions in reinforcement learning by integrating a fast/slow learning paradigm. It introduces a modular Momentum Boosted Episodic Memory (MEM) architecture that uses a familiarity buffer and momentum-based contrastive learning to identify and prioritize rare trajectories, then reinstates their hidden activations through an episodic memory module to improve policy decisions. The approach, which augments IMPALA with a contrastive learning branch and a memory retrieval mechanism, delivers superior performance across Zipfian tasks and Atari benchmarks, outperforming strong baselines and several ablations. This method offers a practical, architecture-agnostic pathway to enhance sample efficiency and long-term credit assignment in non-uniform environments, with potential applicability to more realistic 3D settings and real-world decision-making problems.

Abstract

Traditional Reinforcement Learning (RL) algorithms assume the distribution of the data to be uniform or mostly uniform. However, this is not the case with most real-world applications like autonomous driving or in nature where animals roam. Some experiences are encountered frequently, and most of the remaining experiences occur rarely; the resulting distribution is called Zipfian. Taking inspiration from the theory of complementary learning systems, an architecture for learning from Zipfian distributions is proposed where important long tail trajectories are discovered in an unsupervised manner. The proposal comprises an episodic memory buffer containing a prioritised memory module to ensure important rare trajectories are kept longer to address the Zipfian problem, which needs credit assignment to happen in a sample efficient manner. The experiences are then reinstated from episodic memory and given weighted importance forming the trajectory to be executed. Notably, the proposed architecture is modular, can be incorporated in any RL architecture and yields improved performance in multiple Zipfian tasks over traditional architectures. Our method outperforms IMPALA by a significant margin on all three tasks and all three evaluation metrics (Zipfian, Uniform, and Rare Accuracy) and also gives improvements on most Atari environments that are considered challenging

Paper Structure

This paper contains 16 sections, 8 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Zipf's 3DWorld Task: Contains 7 maps, each with 5 objects placed at random locations. The location of these objects does not change during trials. The agent (Red triangle in top view) starts at a fixed location in each trial and has to navigate towards the target object, whose color is shown in the top-left corner along with the current map ID (0 indexed). The agent's first-person view of each map is shown in the bottom images. The details of the environment experienced can be seen in the annotated image. The value 'p' below each map shows the probability of occurrence of the map in a trial, highlighting the skew in the distribution. A similar skew occurs for the distribution of objects in these maps. We can see this in Figure \ref{['fig:1b']}, which shows the distribution of objects for the first map (most common).
  • Figure 2: The probability distribution for objects to appear as the target object in a map during a trial. This example shows the distribution of objects for Map 1 in Figure \ref{['fig:1a']}.
  • Figure 3: Image augmentations for contrastive learning:(a) Shows downsampled input image for a trial. (b) Input image after adding Gaussian noise to it. (c) Input image after applying random cutout augmentation. The black rectangle near the agent's position is the area cutout. (d) Final augmented image after adding Gaussian noise and random cutout.
  • Figure 4: Model Architecture: The figure shows our momentum-boosted episodic memory architecture pipeline. The IMPALA backbone consists of a CNN feature extractor followed by a Feed Forward layer that gives the embedding. This embedding is concatenated with the one hot action encoding, reward & memory to get pixel embedding $p_i$ and then given to the LSTM network for further processing with working memory. The LSTM network additionally takes the past hidden state $h_{t-1}$ as input. During training the input image, pixel embedding, LSTM hidden states and keys are stored in the familiarity buffer. The momentum loss tracked on this buffer during contrastive learning is then used to prioritize long-tail states. The MEM is then periodically updated with top $t_f$ states from the familiarity buffer. The memory ($m_t$) is computed from the MEM using a weighted sum ($\bigoplus$) over results from a KNN similarity search on the keys present in the MEM using the query key $k_t$ (Equation \ref{['eqn:wsum']}).
  • Figure 5: Performance plots (Zipf's Gridworld): (a) Performance of IMPALA agent on each map and object. The y-axis denotes the map axis, and the x-axis denotes the object axis. Value at (i, j) shows the performance (0-1 scale) of the agent on the trial where the object with ID j is chosen at the map with ID i. An increase in i and j means an increase in the rareness of the map and object respectively according to the Zipf's distribution (Equation \ref{['eqn:zipfslaw']}). (b) Performance of IMPALA with MEM added. We can see there are some medium-rare trials in which the agent has learned to navigate and learn the task. (c) IMPALA with Visual Reconstruction using CNN-based autoencoder. (d) Performance of IMPALA+MEM with contrastive learning. (e) Performance of our agent consisting of familiarity buffer that highlights long tail samples for MEM using modified boosted contrastive learning.
  • ...and 3 more figures