Pretrained Visual Representations in Reinforcement Learning
Emlyn Williams, Athanasios Polydoros
TL;DR
This paper investigates whether pretrained visual representations (PVRs) can replace or augment CNNs trained from scratch in visual reinforcement learning. By benchmarking the Dormant Ratio Minimization (DRM) algorithm against frozen PVR backbones—ResNet18, DINOv2, and Visual Cortex (VC)—on Metaworld Push-v2 and Drawer-Open-v2, it reveals that the superiority of PVRs is task-dependent, with CNN-based encoders performing best on Push-v2 and certain PVRs excelling on Drawer-Open-v2. The study highlights practical benefits of PVRs, including drastically reduced replay-buffer size and faster training, while identifying the dormant ratio as a strong indicator of learning progress and exploration quality. It also shows that ViT-based PVRs can underperform compared to CNNs at standard resolutions, though higher resolutions can partly mitigate this gap. Overall, the findings inform when to favor PVRs versus training from scratch and point to mechanisms to reduce dormant ratio as a key research direction for visual RL.
Abstract
Visual reinforcement learning (RL) has made significant progress in recent years, but the choice of visual feature extractor remains a crucial design decision. This paper compares the performance of RL algorithms that train a convolutional neural network (CNN) from scratch with those that utilize pre-trained visual representations (PVRs). We evaluate the Dormant Ratio Minimization (DRM) algorithm, a state-of-the-art visual RL method, against three PVRs: ResNet18, DINOv2, and Visual Cortex (VC). We use the Metaworld Push-v2 and Drawer-Open-v2 tasks for our comparison. Our results show that the choice of training from scratch compared to using PVRs for maximising performance is task-dependent, but PVRs offer advantages in terms of reduced replay buffer size and faster training times. We also identify a strong correlation between the dormant ratio and model performance, highlighting the importance of exploration in visual RL. Our study provides insights into the trade-offs between training from scratch and using PVRs, informing the design of future visual RL algorithms.
