Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks

David Valencia; Henry Williams; Yuning Xing; Trevor Gee; Minas Liarokapis; Bruce A. MacDonald

Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks

David Valencia, Henry Williams, Yuning Xing, Trevor Gee, Minas Liarokapis, Bruce A. MacDonald

TL;DR

This paper tackles sparse, poorly defined rewards in reinforcement learning by introducing intrinsic motivation signals based on novelty and surprise. It proposes NaSA-TD3, an image-based extension of TD3 that uses an autoencoder to learn latent representations from pixels and incorporates two distinct intrinsic rewards: novelty (via SSIM-based reconstruction familiarity) and surprise (via latent-space prediction error from an ensemble). The method updates the encoder and policy jointly, enabling end-to-end learning from images and achieving state-of-the-art performance on several simulated robotic tasks and a real-world dexterous manipulation task without pretraining or demonstrations. The results demonstrate substantial improvements in exploration and final performance, with practical scalability and applicability to camera-based robotics, albeit with notable RAM and compute requirements.

Abstract

Reinforcement Learning (RL) has been widely used to solve tasks where the environment consistently provides a dense reward value. However, in real-world scenarios, rewards can often be poorly defined or sparse. Auxiliary signals are indispensable for discovering efficient exploration strategies and aiding the learning process. In this work, inspired by intrinsic motivation theory, we postulate that the intrinsic stimuli of novelty and surprise can assist in improving exploration in complex, sparsely rewarded environments. We introduce a novel sample-efficient method able to learn directly from pixels, an image-based extension of TD3 with an autoencoder called \textit{NaSA-TD3}. The experiments demonstrate that NaSA-TD3 is easy to train and an efficient method for tackling complex continuous-control robotic tasks, both in simulated environments and real-world settings. NaSA-TD3 outperforms existing state-of-the-art RL image-based methods in terms of final performance without requiring pre-trained models or human demonstrations.

Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks

TL;DR

Abstract

Paper Structure (13 sections, 3 equations, 8 figures)

This paper contains 13 sections, 3 equations, 8 figures.

Introduction
Definitions
What is Novelty?
What is Surprise?
Related work
NaSA-TD3
Novelty and Surprise Detection
Image-based policy learning
Simulated Environments
Experimental Results
Dexterous Robotics Manipulation in the Real World
Results and Discussion
Conclusion

Figures (8)

Figure 1: Novelty detection diagram. At each time step, an observation is passed to the encoder. The decoder receives the $z$ latent presentation and reconstructs the original observation. SSIM is calculated between the reconstruction and the original observation.
Figure 2: Ensemble of Predictive Model Architecture. Each model predicts the next $z_{t+1}$ latent presentation then the mean of the prediction is calculated.
Figure 3: The proposose NaSA-TD3 method architecture. The encoder network consists of four convolutional layers with $32$ filters with a kernel size of $3\times3$ and $ReLU$ as an activation function. The output of the convolutional layer is flattened and routed to a fully connected layer and a normalization layer with a $Tanh$ activation function. The Decoder network is a deconvolutional mirror of the Encoder with $Sigmoid$ as the final activation function. The TD3 network consists of an actor network and two critic networks. All three networks have two hidden fully connected layers with $1024$ nodes each with $ReLU$. The actor has $Tanh$ as an activation function for the output layer. The predictive ensemble model has two hidden layers with $512$ nodes and $ReLU$, and the output layer has the size of the latent $z$ vector.
Figure 4: Optimal $z$ latent size analysis on the task of Ball and Cup. We ran a trial-error experiment under the same condition, changing the latent size to find the best value.
Figure 5: Image-based control tasks used in our experimentation. The ball in the cup and reacher tasks have a sparse reward that is only given once the ball is caught or the finger touches the red sphere, respectively. Cartpole and walker tasks require balance and constant movement. The finger spin task includes contact between the finger and the object, while the cheetah run task demands coordination and motion of a significant number of joints.
...and 3 more figures

Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks

TL;DR

Abstract

Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (8)