WorldEval: World Model as Real-World Robot Policies Evaluator
Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, Yi Xu
TL;DR
WorldEval introduces a world-model-based framework to safely and scalably evaluate real-world robotic manipulation policies by translating policy latent actions into action-conditioned videos via Policy2Vec. The approach provides an automated verifier and an online pipeline that correlates strongly with real-world performance, often outperforming real-to-sim benchmarks. Key findings include robust policy ranking, the utility of FID as a lightweight proxy for simple tasks, and the ability to detect unsafe policies, with demonstrated generalization to unseen objects and environments. This work offers a practical, scalable tool for rapid policy iteration in robotics research and development.
Abstract
The field of robotics has made significant strides toward developing generalist robot manipulation policies. However, evaluating these policies in real-world scenarios remains time-consuming and challenging, particularly as the number of tasks scales and environmental conditions change. In this work, we demonstrate that world models can serve as a scalable, reproducible, and reliable proxy for real-world robot policy evaluation. A key challenge is generating accurate policy videos from world models that faithfully reflect the robot actions. We observe that directly inputting robot actions or using high-dimensional encoding methods often fails to generate action-following videos. To address this, we propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world simulator that follows latent action to generate the robot video. We then introduce WorldEval, an automated pipeline designed to evaluate real-world robot policies entirely online. WorldEval effectively ranks various robot policies and individual checkpoints within a single policy, and functions as a safety detector to prevent dangerous actions by newly developed robot models. Through comprehensive paired evaluations of manipulation policies in real-world environments, we demonstrate a strong correlation between policy performance in WorldEval and real-world scenarios. Furthermore, our method significantly outperforms popular methods such as real-to-sim approach.
