Table of Contents
Fetching ...

RISE-Video: Can Video Generators Decode Implicit World Rules?

Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang

TL;DR

RISE-Video targets the challenge of TI2V models internalizing implicit world rules by introducing a reasoning-centric benchmark. It defines eight reasoning dimensions, 467 samples, and a four-metric evaluation framework with a scalable LMM-based judging pipeline, evaluated on 11 TI2V models. Findings reveal substantial gaps in higher-level reasoning despite strong perceptual quality, underscoring the need for world-model-aware video generation. The benchmark offers a rigorous evaluation paradigm and practical tools to drive progress in reasoning-enabled TI2V systems.

Abstract

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

RISE-Video: Can Video Generators Decode Implicit World Rules?

TL;DR

RISE-Video targets the challenge of TI2V models internalizing implicit world rules by introducing a reasoning-centric benchmark. It defines eight reasoning dimensions, 467 samples, and a four-metric evaluation framework with a scalable LMM-based judging pipeline, evaluated on 11 TI2V models. Findings reveal substantial gaps in higher-level reasoning despite strong perceptual quality, underscoring the need for world-model-aware video generation. The benchmark offers a rigorous evaluation paradigm and practical tools to drive progress in reasoning-enabled TI2V systems.

Abstract

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
Paper Structure (16 sections, 25 figures, 3 tables)

This paper contains 16 sections, 25 figures, 3 tables.

Figures (25)

  • Figure 1: An example from the Experiential Knowledge dimension of RISE-Video, revealing limitations in experience-based reasoning of current TI2V models.
  • Figure 2: Task distribution of the RISE-Video benchmark, which comprises eight major task categories: Experiential Knowledge, Perceptual Knowledge, Temporal Knowledge, Spatial Knowledge, Commonsense Knowledge, Societal Knowledge, Subject Knowledge, and Logical Capability. Each category further contains comprehensive sub-categories and diverse data samples.
  • Figure 3: Evaluation pipeline of the RISE-Video benchmark. It covers four metrics: Reasoning Alignment, Temporal Consistency, Visual Quality, and Physical Rationality, with dimension-specific frame extraction strategies. Carefully designed prompts guide GPT-5 as the primary judge (GPT-5-mini for Visual Quality only), ensuring fair and objective evaluation.
  • Figure 4: Specialized evaluation pipeline for reasoning alignment in Schematic Puzzle tasks, which are not well-suited for standard LMM-as-a-Judge evaluation, including trajectory-based constraint checking, grid-level structural alignment, and reference-assisted LMM comparison, enabling accurate and interpretable scoring of structured visual reasoning outcomes.
  • Figure 5: Representative generation results of leading models. We show the examples generated by Hailuo 2.3, Veo 3.1, and Kling 2.6.
  • ...and 20 more figures