Table of Contents
Fetching ...

Adapting Vision-Language Models for Evaluating World Models

Mariya Hendriksen, Tabish Rashid, David Bignell, Raluca Georgescu, Abdelhak Lemkhenter, Katja Hofmann, Sam Devlin, Sarah Parisot

TL;DR

This work addresses the challenge of evaluating world-model rollouts with fine-grained temporal and semantic grounding. It introduces UNIVERSE, a unified, vision-language model-based evaluator trained with mixed supervision and efficient frame sampling to assess action-aligned and character-consistent rollouts through structured AR/CR recognition tasks in binary, multiple-choice, and open-ended formats. The authors provide formal problem definitions, a PaliGemma-based architecture, and an adaptation protocol demonstrating parity with task-specific checkpoints and strong alignment with human judgments across diverse environments. They also show how frame sampling strategies, data mixes, and limited tuning enable scalable, semantics-aware evaluation without heavy supervision. The work can significantly improve benchmarking and diagnosis of world models in simulation and embodied AI, while highlighting limitations and future work for real-world, long-horizon evaluation and bias considerations.

Abstract

World models - generative models that simulate environment dynamics conditioned on past observations and actions - are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency - capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce an evaluation protocol targeting two recognition tasks - action recognition and character recognition - each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a VLM-based evaluator for video world model rollouts adapted under data and compute constraints. In our extensive experiments totaling over 5,154 GPU-days, we explore full, partial, and parameter-efficient adaptation methods across various task formats, context lengths, sampling methods, and data compositions. The resulting unified evaluator achieves parity with task-specific checkpoints. Human studies across seven diverse environments confirm strong alignment with human judgments, establishing UNIVERSE as a lightweight, adaptable, and semantics-aware evaluator for video world models.

Adapting Vision-Language Models for Evaluating World Models

TL;DR

This work addresses the challenge of evaluating world-model rollouts with fine-grained temporal and semantic grounding. It introduces UNIVERSE, a unified, vision-language model-based evaluator trained with mixed supervision and efficient frame sampling to assess action-aligned and character-consistent rollouts through structured AR/CR recognition tasks in binary, multiple-choice, and open-ended formats. The authors provide formal problem definitions, a PaliGemma-based architecture, and an adaptation protocol demonstrating parity with task-specific checkpoints and strong alignment with human judgments across diverse environments. They also show how frame sampling strategies, data mixes, and limited tuning enable scalable, semantics-aware evaluation without heavy supervision. The work can significantly improve benchmarking and diagnosis of world models in simulation and embodied AI, while highlighting limitations and future work for real-world, long-horizon evaluation and bias considerations.

Abstract

World models - generative models that simulate environment dynamics conditioned on past observations and actions - are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency - capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce an evaluation protocol targeting two recognition tasks - action recognition and character recognition - each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a VLM-based evaluator for video world model rollouts adapted under data and compute constraints. In our extensive experiments totaling over 5,154 GPU-days, we explore full, partial, and parameter-efficient adaptation methods across various task formats, context lengths, sampling methods, and data compositions. The resulting unified evaluator achieves parity with task-specific checkpoints. Human studies across seven diverse environments confirm strong alignment with human judgments, establishing UNIVERSE as a lightweight, adaptable, and semantics-aware evaluator for video world models.

Paper Structure

This paper contains 37 sections, 7 equations, 17 figures, 18 tables, 1 algorithm.

Figures (17)

  • Figure 1: Performance and efficiency of universe (orange bars throughout) compared to task-specific baselines (multiple colours), all models trained for 10 epochs. Left and Center: Action recognition and Character Recognition accuracy across binary, multiple-choice, and open-ended settings. Right: Sample efficiency -- our adaptation recipe achieves strong performance with substantially fewer training samples per epoch.
  • Figure 2: Comparison of universe and baseline models on Action and Character Recognition, all models trained for 1 epoch. Left: universe outperforms all baselines on AR. Right: On CR, it ranks third, behind models with either full vision encoder tuning or task-specific training with greater supervision. Trained under a unified protocol with minimal parameter updates (0.07%) and reduced per-task data, universe delivers strong performance across both tasks.
  • Figure 3: Action Recognition performance as a function of training supervision (epochs) and temporal context (number of frames), evaluated across all formats. Performance improves along both axes, with highest accuracy achieved when both dimensions are scaled.
  • Figure 4: Effect of frame sampling strategy on Action Recognition performance across all formats. Uniform-$n$ sampling (orange) consistently outperforms first-$n$ (blue), with especially large gains at low frame counts, and maintains an advantage as temporal context increases.
  • Figure 5: Exact Match accuracy for Action Recognition and Character Recognition.
  • ...and 12 more figures