Table of Contents
Fetching ...

WorldEval: World Model as Real-World Robot Policies Evaluator

Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, Yi Xu

TL;DR

WorldEval introduces a world-model-based framework to safely and scalably evaluate real-world robotic manipulation policies by translating policy latent actions into action-conditioned videos via Policy2Vec. The approach provides an automated verifier and an online pipeline that correlates strongly with real-world performance, often outperforming real-to-sim benchmarks. Key findings include robust policy ranking, the utility of FID as a lightweight proxy for simple tasks, and the ability to detect unsafe policies, with demonstrated generalization to unseen objects and environments. This work offers a practical, scalable tool for rapid policy iteration in robotics research and development.

Abstract

The field of robotics has made significant strides toward developing generalist robot manipulation policies. However, evaluating these policies in real-world scenarios remains time-consuming and challenging, particularly as the number of tasks scales and environmental conditions change. In this work, we demonstrate that world models can serve as a scalable, reproducible, and reliable proxy for real-world robot policy evaluation. A key challenge is generating accurate policy videos from world models that faithfully reflect the robot actions. We observe that directly inputting robot actions or using high-dimensional encoding methods often fails to generate action-following videos. To address this, we propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world simulator that follows latent action to generate the robot video. We then introduce WorldEval, an automated pipeline designed to evaluate real-world robot policies entirely online. WorldEval effectively ranks various robot policies and individual checkpoints within a single policy, and functions as a safety detector to prevent dangerous actions by newly developed robot models. Through comprehensive paired evaluations of manipulation policies in real-world environments, we demonstrate a strong correlation between policy performance in WorldEval and real-world scenarios. Furthermore, our method significantly outperforms popular methods such as real-to-sim approach.

WorldEval: World Model as Real-World Robot Policies Evaluator

TL;DR

WorldEval introduces a world-model-based framework to safely and scalably evaluate real-world robotic manipulation policies by translating policy latent actions into action-conditioned videos via Policy2Vec. The approach provides an automated verifier and an online pipeline that correlates strongly with real-world performance, often outperforming real-to-sim benchmarks. Key findings include robust policy ranking, the utility of FID as a lightweight proxy for simple tasks, and the ability to detect unsafe policies, with demonstrated generalization to unseen objects and environments. This work offers a practical, scalable tool for rapid policy iteration in robotics research and development.

Abstract

The field of robotics has made significant strides toward developing generalist robot manipulation policies. However, evaluating these policies in real-world scenarios remains time-consuming and challenging, particularly as the number of tasks scales and environmental conditions change. In this work, we demonstrate that world models can serve as a scalable, reproducible, and reliable proxy for real-world robot policy evaluation. A key challenge is generating accurate policy videos from world models that faithfully reflect the robot actions. We observe that directly inputting robot actions or using high-dimensional encoding methods often fails to generate action-following videos. To address this, we propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world simulator that follows latent action to generate the robot video. We then introduce WorldEval, an automated pipeline designed to evaluate real-world robot policies entirely online. WorldEval effectively ranks various robot policies and individual checkpoints within a single policy, and functions as a safety detector to prevent dangerous actions by newly developed robot models. Through comprehensive paired evaluations of manipulation policies in real-world environments, we demonstrate a strong correlation between policy performance in WorldEval and real-world scenarios. Furthermore, our method significantly outperforms popular methods such as real-to-sim approach.

Paper Structure

This paper contains 15 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: WorldEval is an adaptive, flexible, and reliable approach for evaluating real-world policies across diverse tasks. It demonstrates a strong correlation between its success rates in a world simulator and those observed with real robots.
  • Figure 2: Model architecture and evaluation pipeline for WorldEval.Top: We extract Policy2Vec embeddings from a pool of robot policies and inject them into a pre-trained video generation model, transforming it into a world model. Bottom: The overall WorldEval pipeline for evaluating policy models using the world simulator.
  • Figure 3: Robot and task setup. We illustrate the robot setup in the top-left figure. We employ a bimanual ALOHA-style robot equipped with top camera views, a RealSense 457. Other cameras are not used in our work.
  • Figure 4: Real vs. WorldEval success rates. WorldEval evaluation setup shows a strong correlation to real policy performance. Good policy evaluation proxies have low MMRV and high Pearson correlation (r).
  • Figure 5: Visualization of Real-World Robot Policy and Generated Video Policy.Left: Real-world robot policy. Right: Generated video policy. Top three rows: Tasks successfully completed by the robot. Bottom two rows: Tasks where the model failed. All policies utilized $\pi_{0}$. The video demo is presented in the supplementary material.
  • ...and 5 more figures