Table of Contents
Fetching ...

Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven' Matrices

Hokin Deng

TL;DR

This work investigates whether contemporary video-generation models can reason about structured visual problems. It introduces the VMEvalKit framework and a Task Pair evaluation paradigm to test models on chess, maze, Sudoku, mental rotation, and Raven's matrices, using both automated GPT-4o and human evaluations. Results show a clear performance hierarchy among models, with Sora-2 achieving the highest overall success and domain-specific strengths, while certain domains remain challenging. The study demonstrates scalable evaluation of visual reasoning and points to reinforcement learning and mechanistic interpretability as promising directions for improving reasoning in video models.

Abstract

We show that video generation models could reason now. Testing on tasks such as chess, maze, Sudoku, mental rotation, and Raven's Matrices, leading models such as Sora-2 achieve sixty percent success rates. We establish a robust experimental paradigm centered on the "Task Pair" design. We build a code framework, with 39 models available already, that supports this paradigm and allows for easy scaling - users can add models and tasks efficiently. We show our automated evaluation strongly correlates with human judgment, and therefore this paradigm is highly scalable. We see an opportunity, given the availability of our paradigm, to do reinforcement learning for improving reasoning in video models. You could checkout all of our raw $\href{https://grow-ai-like-a-child.com/video-reason/}{results}$ and our $\href{https://github.com/hokindeng/VMEvalKit}{VMEvalKit}$ codebase.

Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven' Matrices

TL;DR

This work investigates whether contemporary video-generation models can reason about structured visual problems. It introduces the VMEvalKit framework and a Task Pair evaluation paradigm to test models on chess, maze, Sudoku, mental rotation, and Raven's matrices, using both automated GPT-4o and human evaluations. Results show a clear performance hierarchy among models, with Sora-2 achieving the highest overall success and domain-specific strengths, while certain domains remain challenging. The study demonstrates scalable evaluation of visual reasoning and points to reinforcement learning and mechanistic interpretability as promising directions for improving reasoning in video models.

Abstract

We show that video generation models could reason now. Testing on tasks such as chess, maze, Sudoku, mental rotation, and Raven's Matrices, leading models such as Sora-2 achieve sixty percent success rates. We establish a robust experimental paradigm centered on the "Task Pair" design. We build a code framework, with 39 models available already, that supports this paradigm and allows for easy scaling - users can add models and tasks efficiently. We show our automated evaluation strongly correlates with human judgment, and therefore this paradigm is highly scalable. We see an opportunity, given the availability of our paradigm, to do reinforcement learning for improving reasoning in video models. You could checkout all of our raw and our codebase.

Paper Structure

This paper contains 27 sections, 16 figures.

Figures (16)

  • Figure 1: Representative examples of video reasoning across diverse cognitive tasks. We show 6-frame temporal sequences from videos. The robust correlation (r = 0.949, p < 0.001) between human and automated GPT-4o evaluations demonstrates the reliability of our evaluation framework for assessing reasoning in video generation models.
  • Figure 2: Each of our question unit includes three components: (1) an initial image showing the unsolved problem, (2) a text instruction describing the task, and (3) a final image showing the correct solution. During inference, models are only given the initial image and the instruction prompt. The final image is withheld and used solely for evaluation.
  • Figure 3: Success rates on all 5 tasks showing clear performance hierarchy across models.
  • Figure 4: Average success rates by reasoning domain reveal distinct level of challenges.
  • Figure 5: Individual model performance across all reasoning domains.
  • ...and 11 more figures