What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?
Sneha Silwal, Karmesh Yadav, Tingfan Wu, Jay Vakil, Arjun Majumdar, Sergio Arnaud, Claire Chen, Vincent-Pierre Berges, Dhruv Batra, Aravind Rajeswaran, Mrinal Kalakrishnan, Franziska Meier, Oleksandr Maksymets
TL;DR
This work investigates how pre-trained visual representations (PVRs) trained on broad, external data translate to real-world robotic control. By evaluating five PVRs across three robot platforms and two learning paradigms on five tasks, the study reveals strong sim-to-real predictivity after careful alignment, and demonstrates a first zero-shot ImageNav transfer in the real world when using PVRs with RL-trained policies. It also analyzes how model size, fine-tuning, and data augmentation affect real-world performance, finding that certain configurations (e.g., VC-1 Base with augmentation and fine-tuning) offer the best average results, while Sim2Real transfers remain task-dependent. Overall, the paper highlights the value of large-scale sim-based benchmarking for PVRs, the special role of Indoor ImageNav, and practical guidelines for leveraging PVRs in real-world robotics deployments.
Abstract
We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real-world tasks. Our study involves five different PVRs, each trained for five distinct manipulation or indoor navigation tasks. We performed this evaluation using three different robots and two different policy learning paradigms. From this effort, we can arrive at three insights: 1) the performance trends of PVRs in the simulation are generally indicative of their trends in the real world, 2) the use of PVRs enables a first-of-its-kind result with indoor ImageNav (zero-shot transfer to a held-out scene in the real world), and 3) the benefits from variations in PVRs, primarily data-augmentation and fine-tuning, also transfer to the real-world performance. See project website for additional details and visuals.
