Table of Contents
Fetching ...

MIND: Benchmarking Memory Consistency and Action Control in World Models

Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, Alex Jinpeng Wang

TL;DR

MIND addresses the lack of a comprehensive benchmark for memory consistency and action control in open-domain world models across multiple viewpoints. It introduces a large, high-quality dataset rendered in Unreal Engine 5, along with a rigorous evaluation framework that quantifies memory stability, long-context recall, and action-generalization under varied action spaces, complemented by the Video-to-World baseline MIND-World. The work provides concrete metrics such as $L_{\text{mem}}$, $L_{\text{lcm}}$, and $L_{\text{gsc}}$, and demonstrates through experiments that current models improve with memory context but still struggle with cross-action-space generalization and long-horizon coherence. The dataset and benchmark offer a standardized, reproducible platform to accelerate development of temporally consistent, memory-aware open-domain world models, with practical implications for interactive AI systems and simulation-based research.

Abstract

World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Project page: https://csu-jpg.github.io/MIND.github.io/

MIND: Benchmarking Memory Consistency and Action Control in World Models

TL;DR

MIND addresses the lack of a comprehensive benchmark for memory consistency and action control in open-domain world models across multiple viewpoints. It introduces a large, high-quality dataset rendered in Unreal Engine 5, along with a rigorous evaluation framework that quantifies memory stability, long-context recall, and action-generalization under varied action spaces, complemented by the Video-to-World baseline MIND-World. The work provides concrete metrics such as , , and , and demonstrates through experiments that current models improve with memory context but still struggle with cross-action-space generalization and long-horizon coherence. The dataset and benchmark offer a standardized, reproducible platform to accelerate development of temporally consistent, memory-aware open-domain world models, with practical implications for interactive AI systems and simulation-based research.

Abstract

World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Project page: https://csu-jpg.github.io/MIND.github.io/
Paper Structure (17 sections, 4 equations, 8 figures, 3 tables)

This paper contains 17 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Evaluation for Memory Consistency and Action Control with MIND. The first open-domain closed-loop revisited benchmark at $1080$p/$24$ FPS for evaluating world models from both first-person and third-person perspectives.
  • Figure 2: Overview of the MIND. We build and collect the first open-domain closed-loop revisited benchmark using Unreal Engine 5, supporting both first-person and third-person perspectives with $1080$ p resolution at $24$ FPS.
  • Figure 3: Distribution for Scene Categories and Action Space in MIND .MIND supports open-domain scenarios with diverse and well-balanced action spaces.
  • Figure 4: Action Generalization from MIND. Different generalization settings for $\Delta_p$ (movement increment) and $\Delta_r$ (camera angle increment) are derived from both first-person and third-person perspectives. Each image is captured after the action has been executed for 24 frames.
  • Figure 5: The 10 Symmetric Motion Paths. The blue line represents the original path, and the red line represents the corresponding mirrored path. Each action lasts $24$ frames.
  • ...and 3 more figures