Visual Episodic Memory-based Exploration

Jack Vice; Natalie Ruiz-Sanchez; Pamela K. Douglas; Gita Sukthankar

Visual Episodic Memory-based Exploration

Jack Vice, Natalie Ruiz-Sanchez, Pamela K. Douglas, Gita Sukthankar

TL;DR

The paper addresses exploration in robotics under sparse extrinsic rewards by introducing visual episodic memory as an intrinsic motivation signal. It proposes a twin ConvLSTM autoencoder architecture that reconstructs ten-frame video sequences, using multi-frame SSIM as the intrinsic reward to guide exploration toward poorly predicted, dynamic spatiotemporal regions. Empirical results show superior performance to CVAE-based curiosity in detecting dynamic anomalies and reconstructing real-world video, while identifying catastrophic forgetting when learning proceeds during exploration. The work advances autonomous exploration for tasks like search and rescue and security by leveraging temporal-spatial visual memory to drive curiosity-driven behavior with practical robustness considerations.

Abstract

In humans, intrinsic motivation is an important mechanism for open-ended cognitive development; in robots, it has been shown to be valuable for exploration. An important aspect of human cognitive development is $\textit{episodic memory}$ which enables both the recollection of events from the past and the projection of subjective future. This paper explores the use of visual episodic memory as a source of intrinsic motivation for robotic exploration problems. Using a convolutional recurrent neural network autoencoder, the agent learns an efficient representation for spatiotemporal features such that accurate sequence prediction can only happen once spatiotemporal features have been learned. Structural similarity between ground truth and autoencoder generated images is used as an intrinsic motivation signal to guide exploration. Our proposed episodic memory model also implicitly accounts for the agent's actions, motivating the robot to seek new interactive experiences rather than just areas that are visually dissimilar. When guiding robotic exploration, our proposed method outperforms the Curiosity-driven Variational Autoencoder (CVAE) at finding dynamic anomalies.

Visual Episodic Memory-based Exploration

TL;DR

Abstract

which enables both the recollection of events from the past and the projection of subjective future. This paper explores the use of visual episodic memory as a source of intrinsic motivation for robotic exploration problems. Using a convolutional recurrent neural network autoencoder, the agent learns an efficient representation for spatiotemporal features such that accurate sequence prediction can only happen once spatiotemporal features have been learned. Structural similarity between ground truth and autoencoder generated images is used as an intrinsic motivation signal to guide exploration. Our proposed episodic memory model also implicitly accounts for the agent's actions, motivating the robot to seek new interactive experiences rather than just areas that are visually dissimilar. When guiding robotic exploration, our proposed method outperforms the Curiosity-driven Variational Autoencoder (CVAE) at finding dynamic anomalies.

Paper Structure (6 sections, 1 equation, 11 figures)

This paper contains 6 sections, 1 equation, 11 figures.

Introduction
Related Work
Method
Experimental Setup
Results
Conclusion and Future Work

Figures (11)

Figure 1: Our architecture consists of a simulation environment, twin convolutional LSTM autoencoders and a frontier exploration based navigation stack. The twin models run asynchronously with weights copied from the training model to the inference only model, enabling faster predictions for the mobile robot.
Figure 2: The visual episodic memory consists of a convolutional LSTM autoencoder. The autoencoder processes ten video frames simultaneously and attempts to reconstruct input the frames by learning the spatiotemporal patterns of the environment. The autoencoder bottleneck forces learning of a dense representation and prevents overfitting.
Figure 3: During training the model learns to reconstruct all the sequence frames. Starting in the upper left, five predicted frames of a dynamic scene are shown spanning 1800 epochs. The ground truth frame is shown in the lower right.
Figure 4: The number of anomaly rooms (blue) vs. non-anomaly ones (orange) explored by the robot, summed over ten trials. Curiosity-driven Variational Autoencoder (CVAE) han2020curiosity was tested with equivalent training and inference conditions. As expected the frontier exploration method yamauchi1997frontier is insensitive to visual anomalies and explores rooms in equal proportion. The difference between our proposed LSTM Inference technique and the comparison methods (Frontier and VAE) is statistically significant ($p<0.05$).
Figure 5: The frame reconstruction error for the four types of anomaly rooms. The high error spikes for moving anomalies indicate the difficulty of predicting unlearned dynamics.
...and 6 more figures

Visual Episodic Memory-based Exploration

TL;DR

Abstract

Visual Episodic Memory-based Exploration

Authors

TL;DR

Abstract

Table of Contents

Figures (11)