Adapting Vision-Language Models for Evaluating World Models
Mariya Hendriksen, Tabish Rashid, David Bignell, Raluca Georgescu, Abdelhak Lemkhenter, Katja Hofmann, Sam Devlin, Sarah Parisot
TL;DR
This work addresses the challenge of evaluating world-model rollouts with fine-grained temporal and semantic grounding. It introduces UNIVERSE, a unified, vision-language model-based evaluator trained with mixed supervision and efficient frame sampling to assess action-aligned and character-consistent rollouts through structured AR/CR recognition tasks in binary, multiple-choice, and open-ended formats. The authors provide formal problem definitions, a PaliGemma-based architecture, and an adaptation protocol demonstrating parity with task-specific checkpoints and strong alignment with human judgments across diverse environments. They also show how frame sampling strategies, data mixes, and limited tuning enable scalable, semantics-aware evaluation without heavy supervision. The work can significantly improve benchmarking and diagnosis of world models in simulation and embodied AI, while highlighting limitations and future work for real-world, long-horizon evaluation and bias considerations.
Abstract
World models - generative models that simulate environment dynamics conditioned on past observations and actions - are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency - capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce an evaluation protocol targeting two recognition tasks - action recognition and character recognition - each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a VLM-based evaluator for video world model rollouts adapted under data and compute constraints. In our extensive experiments totaling over 5,154 GPU-days, we explore full, partial, and parameter-efficient adaptation methods across various task formats, context lengths, sampling methods, and data compositions. The resulting unified evaluator achieves parity with task-specific checkpoints. Human studies across seven diverse environments confirm strong alignment with human judgments, establishing UNIVERSE as a lightweight, adaptable, and semantics-aware evaluator for video world models.
