Table of Contents
Fetching ...

iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework

Jianjie Fang, Yingshan Lei, Qin Wan, Ziyou Wang, Yuchao Huang, Yongyan Xu, Baining Zhao, Weichen Zhang, Chen Gao, Xinlei Chen, Yong Li

Abstract

Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, reasoning, and action. Yet current research still lacks large-scale datasets and unified benchmarks to evaluate their physical interaction capabilities. To address this, we propose iWorld-Bench, a comprehensive benchmark for training and testing world models on interaction-related abilities such as distance perception and memory. We construct a diverse dataset with 330k video clips and select 2.1k high-quality samples covering varied perspectives, weather, and scenes. As existing world models differ in interaction modalities, we introduce an Action Generation Framework to unify evaluation and design six task types, generating 4.9k test samples. These tasks jointly assess model performance across visual generation, trajectory following, and memory. Evaluating 14 representative world models, we identify key limitations and provide insights for future research. The iWorld-Bench model leaderboard is publicly available at iWorld-Bench.com.

iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework

Abstract

Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, reasoning, and action. Yet current research still lacks large-scale datasets and unified benchmarks to evaluate their physical interaction capabilities. To address this, we propose iWorld-Bench, a comprehensive benchmark for training and testing world models on interaction-related abilities such as distance perception and memory. We construct a diverse dataset with 330k video clips and select 2.1k high-quality samples covering varied perspectives, weather, and scenes. As existing world models differ in interaction modalities, we introduce an Action Generation Framework to unify evaluation and design six task types, generating 4.9k test samples. These tasks jointly assess model performance across visual generation, trajectory following, and memory. Evaluating 14 representative world models, we identify key limitations and provide insights for future research. The iWorld-Bench model leaderboard is publicly available at iWorld-Bench.com.

Paper Structure

This paper contains 39 sections, 17 equations, 13 figures, 13 tables, 1 algorithm.

Figures (13)

  • Figure 1: Overview of iWorld-Bench. iWorld-Bench encompasses four distinct perspectives: Unmanned Ground Vehicles (UGVs), Unmanned Aerial Vehicles (UAVs), humans, and robotics. It incorporates nine types of outdoor weather conditions, five different indoor lighting conditions, thousand of diverse scenes, and thousands of entities, providing a comprehensive and diverse evaluation environment. The benchmark leverages an Action Generation Framework to systematically and uniformly assess the interaction capabilities of interactive world models across various input modalities. It is composed of six tasks, each involving a varying number of trajectories, designed to evaluate the adaptability and performance of models in dynamic and complex scenarios.Visualization of camera trajectory and view control in iWorldBench: $\boldsymbol{\rightarrow}$ denotes linear control commands for directional movement, $\boldsymbol{\dashrightarrow}$ represents actual trajectories generated by world models, and $\boldsymbol{\curvearrowright}$ indicates curved view rotation in the specified direction.
  • Figure 2: Data Processing Pipeline and Overview. As shown in Figure \ref{['fig:datapipline']} a), our data processing pipeline consists of the following four steps: 1) Data Collection: Collecting 27.8M multi-image data from 12 open-source datasets and 18 high-quality simulators. 2) Data Unification: Standardizing the coordinate systems and formats of the data, followed by filtering, resulting in 330K video data. 3) VLM-Assisted Annotation: Using vision-language models (VLMs) to automatically annotate the 330K videos. 4) Human Verification: Selecting 2,100 videos from the dataset and generating 4,900 high-quality tasks through human annotation. Figure \ref{['fig:datapipline']} b) shows the distribution of the top 100 scenes in the dataset, reflecting the diversity of the scenes. Figure \ref{['fig:datapipline']} c) illustrates the 9 types of outdoor environments (including foggy, snowy night, partly cloudy, rainy night, snowy, night, rainy, cloudy, and sunny) and 5 types of indoor lighting conditions (fluorescent, natural, dim, warm, and artificial). Figure \ref{['fig:datapipline']} d) uses a word cloud to demonstrate the complexity of entities in the dataset, covering a wide variety of objects and scene elements. Figure \ref{['fig:datapipline']} e) presents examples of world observations under four perspectives (drone, autonomous vehicle, pedestrian, and robot), visually showcasing the diversity and high quality of the dataset.
  • Figure 3: Radar charts of model performance across evaluation metrics. (a) Performance comparison of all 14 models on Action Control and Memory Ability tasks across 8 metrics. (b) Performance of 7 camera-parameter-controlled models on Camera Following tasks.
  • Figure 5: The systematic data curation pipeline: transforming raw multi-source inputs into high-fidelity training data via structured standardization and trajectory rectification.
  • Figure 6: A detailed showcase of the Difference Verification task across four difficulty levels (rows 1 to 4), based on camera operation complexity: Row 1 shows basic single-axis movements; Row 2 adds combined translation and rotation; Row 3 features sequential composite trajectories; and Row 4 involves complex multi-axis movements with view changes. These examples demonstrate the model's ability to detect subtle pose differences across various scenarios.
  • ...and 8 more figures