ViSaRL: Visual Reinforcement Learning Guided by Human Saliency

Anthony Liang; Jesse Thomason; Erdem Bıyık

ViSaRL: Visual Reinforcement Learning Guided by Human Saliency

Anthony Liang, Jesse Thomason, Erdem Bıyık

TL;DR

It is shown that visual representations learned using ViSaRL are robust to various sources of visual perturbations including perceptual noise and scene variations and nearly doubles success rate on the real-robot tasks compared to the baseline which does not use saliency.

Abstract

Training robots to perform complex control tasks from high-dimensional pixel input using reinforcement learning (RL) is sample-inefficient, because image observations are comprised primarily of task-irrelevant information. By contrast, humans are able to visually attend to task-relevant objects and areas. Based on this insight, we introduce Visual Saliency-Guided Reinforcement Learning (ViSaRL). Using ViSaRL to learn visual representations significantly improves the success rate, sample efficiency, and generalization of an RL agent on diverse tasks including DeepMind Control benchmark, robot manipulation in simulation and on a real robot. We present approaches for incorporating saliency into both CNN and Transformer-based encoders. We show that visual representations learned using ViSaRL are robust to various sources of visual perturbations including perceptual noise and scene variations. ViSaRL nearly doubles success rate on the real-robot tasks compared to the baseline which does not use saliency.

ViSaRL: Visual Reinforcement Learning Guided by Human Saliency

TL;DR

Abstract

Paper Structure (9 sections, 11 figures, 8 tables, 1 algorithm)

This paper contains 9 sections, 11 figures, 8 tables, 1 algorithm.

INTRODUCTION
RELATED WORK
VISUAL SALIENCY-GUIDED RL
EXPERIMENT SETUP
SIMULATION EXPERIMENTS
CNN Encoder
MultiMAE Transformer
REAL ROBOT EXPERIMENTS
CONCLUSION

Figures (11)

Figure 1: ViSaRL trains a saliency prediction model from a few human-annotated saliency maps. This model is used to augment an offline image dataset with saliency. A visual encoder is pretrained with the dataset and used during downstream policy learning to generate latent representations of the agent's observations.
Figure 2: Annotation Interface. Custom click-based saliency annotation interface. Each click generates a Gaussian centered at the clicked coordinate with some variance. Warmer colors denote more salient regions such the drawer handle and the robot's end-effector.
Figure 3: ViSaRL. We pretrain a MultiMAE bachmann2022multimae Transformer on a dataset of paired images and saliency maps. MultiMAE employs a self-supervised objective in which masked patches for both input modalities are reconstructed given only the visible patches. The pretrained model is frozen and used for extracting representations during task learning. There is no input masking during downstream RL.
Figure 4: Learning curves for four robot manipulation tasks in Meta-World evaluated by task success rate. (Top) CNN encoder methods. (Bottom) Transformer encoder methods. We select tasks that require manipulating small objects with different motions such as a pushing, pulling, and reaching. The solid lines represent the mean and shaded region the standard error across three seeds.
Figure 5: Evaluation Tasks. Four Meta-World (top) simulation tasks and four real-robot tabletop manipulation tasks (bottom).
...and 6 more figures

ViSaRL: Visual Reinforcement Learning Guided by Human Saliency

TL;DR

Abstract

ViSaRL: Visual Reinforcement Learning Guided by Human Saliency

Authors

TL;DR

Abstract

Table of Contents

Figures (11)