What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?

Sneha Silwal; Karmesh Yadav; Tingfan Wu; Jay Vakil; Arjun Majumdar; Sergio Arnaud; Claire Chen; Vincent-Pierre Berges; Dhruv Batra; Aravind Rajeswaran; Mrinal Kalakrishnan; Franziska Meier; Oleksandr Maksymets

What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?

Sneha Silwal, Karmesh Yadav, Tingfan Wu, Jay Vakil, Arjun Majumdar, Sergio Arnaud, Claire Chen, Vincent-Pierre Berges, Dhruv Batra, Aravind Rajeswaran, Mrinal Kalakrishnan, Franziska Meier, Oleksandr Maksymets

TL;DR

This work investigates how pre-trained visual representations (PVRs) trained on broad, external data translate to real-world robotic control. By evaluating five PVRs across three robot platforms and two learning paradigms on five tasks, the study reveals strong sim-to-real predictivity after careful alignment, and demonstrates a first zero-shot ImageNav transfer in the real world when using PVRs with RL-trained policies. It also analyzes how model size, fine-tuning, and data augmentation affect real-world performance, finding that certain configurations (e.g., VC-1 Base with augmentation and fine-tuning) offer the best average results, while Sim2Real transfers remain task-dependent. Overall, the paper highlights the value of large-scale sim-based benchmarking for PVRs, the special role of Indoor ImageNav, and practical guidelines for leveraging PVRs in real-world robotics deployments.

Abstract

We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real-world tasks. Our study involves five different PVRs, each trained for five distinct manipulation or indoor navigation tasks. We performed this evaluation using three different robots and two different policy learning paradigms. From this effort, we can arrive at three insights: 1) the performance trends of PVRs in the simulation are generally indicative of their trends in the real world, 2) the use of PVRs enables a first-of-its-kind result with indoor ImageNav (zero-shot transfer to a held-out scene in the real world), and 3) the benefits from variations in PVRs, primarily data-augmentation and fine-tuning, also transfer to the real-world performance. See project website for additional details and visuals.

What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?

TL;DR

Abstract

Paper Structure (20 sections, 1 equation, 11 figures, 11 tables)

This paper contains 20 sections, 1 equation, 11 figures, 11 tables.

INTRODUCTION
RELATED WORK
SIMULATED AND REAL-WORLD MANIPULATION AND NAVIGATION TASKS
Planar Cube Manipulation with a Trifinger Robot via Behavior Cloning
Manipulation Tasks with a Franka Robot via Behavior Cloning
Visual Navigation with a Stretch Robot via Large-Scale Reinforcement Learning
EXPERIMENTAL FINDINGS
Evaluating Pre-Trained Visual Representations (PVRs) in Simulation and Reality
Sim Predictivity of hardware results when policies are trained on real demonstrations
Effect of Sim2Real Policy Transfer on Simulation Predictivity
Impact of Model Size, Fine-Tuning, and Data Augmentation
CONCLUSIONS
Acknowledgements
APPENDIX
Details of Manipulation and Navigation tasks
...and 5 more sections

Figures (11)

Figure 1: We conducted 348 experiments with PVRs on five tasks (push cube, pick up bottle, open drawer, reach goal position, and image-goal navigation (ImageNav)), three robots (Trifinger, Franka, and Stretch), two learning paradigms (imitation and reinforcement learning), in sim and reality.
Figure 2: Top row: Our 5 task in the simulation setting. Bottom row: Corresponding tasks on hardware.
Figure 3: Comparison of Sim Predictivity between CortexBench (left) and our simulation setting (right). Each data point represents a (model, task) tuple. Models and tasks are depicted by colors and symbols respectively, as shown in the legend.
Figure 4: Sim Predictivity chart comparing correlations of sim performance to policies trained in the real world (blue), vs correlations of sim performance to policies trained in sim and transferred to hardware (red). Sim2Real transfer (red) is poor across the board for tasks that use few-shot imitation learning; as seen by the red points at the bottom of the plot. Transfer performance is substantially better on ImageNav (red cross markers), which is trained using large-scale reinforcement learning on simulated scenes.
Figure 5: Sim Predictivity correlation plots analyzing the impact of different model variations of VC-1: (a, b) model size; (c, d) fine-tuning; (e, f) data augmentation.
...and 6 more figures

What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?

TL;DR

Abstract

What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?

Authors

TL;DR

Abstract

Table of Contents

Figures (11)