Table of Contents
Fetching ...

PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne, Muhammad Zubair Irshad, Sergey Zakharov, Yue Wang, Sergey Levine, Chelsea Finn, Wei-Chiu Ma, Dhruv Shah, Abhishek Gupta, Karl Pertsch

TL;DR

PolaRiS presents a scalable real-to-sim evaluation framework that converts short real-world video scans into high-fidelity simulated environments using neural scene reconstruction (2DGS) and articulated Gaussian splats. A lightweight co-training procedure aligns simulated visuals with real-world perception, enabling zero-shot evaluation of generalist Vision-Language-Action policies in unseen environments. Empirical results show strong real-to-sim correlation (Pearson r around 0.9) and favorable alignment with RoboArena, outperforming traditional simulators and video-model baselines. The approach enables rapid creation of diverse, realistic evaluation scenes and aims to democratize large-scale robotic policy benchmarking.

Abstract

A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models.

PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

TL;DR

PolaRiS presents a scalable real-to-sim evaluation framework that converts short real-world video scans into high-fidelity simulated environments using neural scene reconstruction (2DGS) and articulated Gaussian splats. A lightweight co-training procedure aligns simulated visuals with real-world perception, enabling zero-shot evaluation of generalist Vision-Language-Action policies in unseen environments. Empirical results show strong real-to-sim correlation (Pearson r around 0.9) and favorable alignment with RoboArena, outperforming traditional simulators and video-model baselines. The approach enables rapid creation of diverse, realistic evaluation scenes and aims to democratize large-scale robotic policy benchmarking.

Abstract

A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models.

Paper Structure

This paper contains 22 sections, 5 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of PolaRiS. PolaRiS is a real-to-sim approach that turns short videos of real-world scenes into high-fidelity simulated environments for scalable robot policy evaluation. We demonstrate that co-finetuning with simulation data for a small number of iterations enables generalist policies to be evaluated accurately in unseenPolaRiS environments out of the box. As part of this work, we provide: (1) tools for quickly creating new PolaRiS environments, (2) an open-source simulation dataset for finetuning generalist policies for PolaRiS evaluation, (3) off-the-shelf evaluation environments with strong real-to-sim correlation, and (4) a hub for sharing new PolaRiS environments with the community.
  • Figure 2: Environment creation in PolaRiS. A user first scans in a real-world environment, robot and objects. They then use PolaRiS to create articulated Gaussian splat representations of the real-world environment that capture geometry and visual appearance of the environment. Finally, they compose the evaluation scene by combining scanned-in scene, objects and robot to use for policy evaluation.
  • Figure 3: Example of a scene being composed in our https://polaris-evals.github.io/compose-environments/. Users can easily import environment and object assets into the tool, automatically load the DROID robot, and save the simulation ready environment into a USD. The procedure typically takes less than 5 minutes.
  • Figure 4: PolaRiS simulated co-training dataset environments. We collect a small number of demonstrations in simulation and co-finetune policies with this data for improved real-to-sim evaluation correlation. Importantly, once finetuned, policies evaluated in PolaRiS show strong real-to-sim correlation, even in unseen environments. New simulated evaluation environments are easily added without needing to collect additional demonstrations.
  • Figure 5: Evaluation environments. Top: real-world environments, Bottom: PolaRiS simulated evaluation replicas. We create high visual fidelity environments utilizing Gaussian Splats for environment reconstruction and TRELLIS for object asset generation.
  • ...and 8 more figures