Table of Contents
Fetching ...

Pretrained Visual Representations in Reinforcement Learning

Emlyn Williams, Athanasios Polydoros

TL;DR

This paper investigates whether pretrained visual representations (PVRs) can replace or augment CNNs trained from scratch in visual reinforcement learning. By benchmarking the Dormant Ratio Minimization (DRM) algorithm against frozen PVR backbones—ResNet18, DINOv2, and Visual Cortex (VC)—on Metaworld Push-v2 and Drawer-Open-v2, it reveals that the superiority of PVRs is task-dependent, with CNN-based encoders performing best on Push-v2 and certain PVRs excelling on Drawer-Open-v2. The study highlights practical benefits of PVRs, including drastically reduced replay-buffer size and faster training, while identifying the dormant ratio as a strong indicator of learning progress and exploration quality. It also shows that ViT-based PVRs can underperform compared to CNNs at standard resolutions, though higher resolutions can partly mitigate this gap. Overall, the findings inform when to favor PVRs versus training from scratch and point to mechanisms to reduce dormant ratio as a key research direction for visual RL.

Abstract

Visual reinforcement learning (RL) has made significant progress in recent years, but the choice of visual feature extractor remains a crucial design decision. This paper compares the performance of RL algorithms that train a convolutional neural network (CNN) from scratch with those that utilize pre-trained visual representations (PVRs). We evaluate the Dormant Ratio Minimization (DRM) algorithm, a state-of-the-art visual RL method, against three PVRs: ResNet18, DINOv2, and Visual Cortex (VC). We use the Metaworld Push-v2 and Drawer-Open-v2 tasks for our comparison. Our results show that the choice of training from scratch compared to using PVRs for maximising performance is task-dependent, but PVRs offer advantages in terms of reduced replay buffer size and faster training times. We also identify a strong correlation between the dormant ratio and model performance, highlighting the importance of exploration in visual RL. Our study provides insights into the trade-offs between training from scratch and using PVRs, informing the design of future visual RL algorithms.

Pretrained Visual Representations in Reinforcement Learning

TL;DR

This paper investigates whether pretrained visual representations (PVRs) can replace or augment CNNs trained from scratch in visual reinforcement learning. By benchmarking the Dormant Ratio Minimization (DRM) algorithm against frozen PVR backbones—ResNet18, DINOv2, and Visual Cortex (VC)—on Metaworld Push-v2 and Drawer-Open-v2, it reveals that the superiority of PVRs is task-dependent, with CNN-based encoders performing best on Push-v2 and certain PVRs excelling on Drawer-Open-v2. The study highlights practical benefits of PVRs, including drastically reduced replay-buffer size and faster training, while identifying the dormant ratio as a strong indicator of learning progress and exploration quality. It also shows that ViT-based PVRs can underperform compared to CNNs at standard resolutions, though higher resolutions can partly mitigate this gap. Overall, the findings inform when to favor PVRs versus training from scratch and point to mechanisms to reduce dormant ratio as a key research direction for visual RL.

Abstract

Visual reinforcement learning (RL) has made significant progress in recent years, but the choice of visual feature extractor remains a crucial design decision. This paper compares the performance of RL algorithms that train a convolutional neural network (CNN) from scratch with those that utilize pre-trained visual representations (PVRs). We evaluate the Dormant Ratio Minimization (DRM) algorithm, a state-of-the-art visual RL method, against three PVRs: ResNet18, DINOv2, and Visual Cortex (VC). We use the Metaworld Push-v2 and Drawer-Open-v2 tasks for our comparison. Our results show that the choice of training from scratch compared to using PVRs for maximising performance is task-dependent, but PVRs offer advantages in terms of reduced replay buffer size and faster training times. We also identify a strong correlation between the dormant ratio and model performance, highlighting the importance of exploration in visual RL. Our study provides insights into the trade-offs between training from scratch and using PVRs, informing the design of future visual RL algorithms.
Paper Structure (19 sections, 7 figures, 1 table)

This paper contains 19 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Metaworld Tasks
  • Figure 2: Comparison of using plain images to using random shift augmentations.
  • Figure 3: Mean performance on Push-v2 Metaworld task.
  • Figure 4: Mean performance on Push-v2 Metaworld task with increased resolutions for ViT PVRs.
  • Figure 5: Per Seed Performance of DINOv2 CLS token at 224x224 resolution
  • ...and 2 more figures