PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies
Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne, Muhammad Zubair Irshad, Sergey Zakharov, Yue Wang, Sergey Levine, Chelsea Finn, Wei-Chiu Ma, Dhruv Shah, Abhishek Gupta, Karl Pertsch
TL;DR
PolaRiS presents a scalable real-to-sim evaluation framework that converts short real-world video scans into high-fidelity simulated environments using neural scene reconstruction (2DGS) and articulated Gaussian splats. A lightweight co-training procedure aligns simulated visuals with real-world perception, enabling zero-shot evaluation of generalist Vision-Language-Action policies in unseen environments. Empirical results show strong real-to-sim correlation (Pearson r around 0.9) and favorable alignment with RoboArena, outperforming traditional simulators and video-model baselines. The approach enables rapid creation of diverse, realistic evaluation scenes and aims to democratize large-scale robotic policy benchmarking.
Abstract
A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models.
