Table of Contents
Fetching ...

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng

TL;DR

This study empirically examines whether state-of-the-art video models can serve as zero-shot visual reasoners through Chain-of-Frame reasoning. It introduces MME-CoF, a compact benchmark spanning 12 reasoning categories to standardize CoF-based evaluation of Veo-3 and other models. Across diverse tasks, the results show strong short-horizon coherence and grounding but notable deficiencies in long-horizon causal reasoning, geometric constraints, and abstract logic, suggesting that current video models are not yet reliable standalone zero-shot reasoners. The work highlights the potential of CoF-inspired reasoning as a complementary pathway for future collaborative visual reasoning with specialized models and prompts.

Abstract

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

TL;DR

This study empirically examines whether state-of-the-art video models can serve as zero-shot visual reasoners through Chain-of-Frame reasoning. It introduces MME-CoF, a compact benchmark spanning 12 reasoning categories to standardize CoF-based evaluation of Veo-3 and other models. Across diverse tasks, the results show strong short-horizon coherence and grounding but notable deficiencies in long-horizon causal reasoning, geometric constraints, and abstract logic, suggesting that current video models are not yet reliable standalone zero-shot reasoners. The work highlights the potential of CoF-inspired reasoning as a complementary pathway for future collaborative visual reasoning with specialized models and prompts.

Abstract

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

Paper Structure

This paper contains 85 sections, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Overview of Our Study on the Reasoning Potential of Video Models. We investigate whether state-of-the-art video models exhibit emergent reasoning potentials beyond content synthesis. The analysis spans 12 reasoning dimensions under a unified perspective, exploring whether large-scale video models can serve as zero-shot visual reasoners via CoF reasoning.
  • Figure 2: Evaluation Radar Map.
  • Figure 3: Word Cloud.
  • Figure 5: Showcase of Visual Detail Reasoning by Veo-3. It illustrates Veo-3's ability to localize targets and maintain fine-grained visual attributes across frames, together with common failure modes when targets are small, occluded, or embedded in clutter.
  • Figure 6: Showcase of Visual Trace Reasoning by Veo-3 (Part I). It shows short-horizon path-following successes, object-grounding failures, and a certain bias that causes step omissions/mistakes in multi-step traces. $^\dagger$ The ground-truth answers of cases II and III are intuitive and non-unique, which are omitted to highlight the key reasoning behaviors.
  • ...and 15 more figures