Table of Contents
Fetching ...

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, Xu Sun

TL;DR

VideoReasonBench targets vision-centric, complex video reasoning by formalizing latent-state video tasks across three escalating levels and six demonstration types. The authors couple a video engine and question engine to generate 1,440 questions over 240 videos and evaluate 18 MLLMs plus humans, revealing large gaps between current models and human performance and highlighting the crucial role of extended chain-of-thought reasoning for this domain. Unlike existing benchmarks, the dataset shows substantial reliance on visual information and a steep performance drop when visual input is reduced, emphasizing the need for robust visual perception and reasoning capabilities in multimodal models. The work provides a challenging, scalable testbed and prompts future research directions to develop models capable of deeper visual reasoning and CoT-enabled strategies, with integration planned into VLMEvalKit.

Abstract

Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

TL;DR

VideoReasonBench targets vision-centric, complex video reasoning by formalizing latent-state video tasks across three escalating levels and six demonstration types. The authors couple a video engine and question engine to generate 1,440 questions over 240 videos and evaluate 18 MLLMs plus humans, revealing large gaps between current models and human performance and highlighting the crucial role of extended chain-of-thought reasoning for this domain. Unlike existing benchmarks, the dataset shows substantial reliance on visual information and a steep performance drop when visual input is reduced, emphasizing the need for robust visual perception and reasoning capabilities in multimodal models. The work provides a challenging, scalable testbed and prompts future research directions to develop models capable of deeper visual reasoning and CoT-enabled strategies, with integration planned into VLMEvalKit.

Abstract

Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

Paper Structure

This paper contains 30 sections, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Examples from VideoReasonBench and three existing VideoQA benchmarks. Responses are generated by Gemini-2.5-Flash in both "Thinking" and "No Thinking" modes. The text highlighted in green/red indicate correct/incorrect responses. While questions from existing benchmarks can be answered correctly without "Thinking" using only a few tokens, VideoReasonBench requires "Thinking" for accurate reasoning and consumes substantially more tokens (See Figure \ref{['fig:varingthinkingbudget']} for quantitative results). It also demands finer-grained visual perception during reasoning.
  • Figure 2: Illustration of vision-centric complex video reasoning. Upper: In each video, the latent state is revealed either at the begin or the end, and a sequence of observable operations is applied to this state. There are six categories of videos, each featuring a different type of demonstration. Lower: The questions assess video reasoning across three levels, with two skills for each level.
  • Figure 3: Overview of our data construction framework. The video engine generates state transitions from a given configuration, producing videos via Matplotlib, command-line screenshots, or real-world manual recordings. The question engine then generates questions and derives answers based on the state transitions, following the rules of each demonstration.
  • Figure 4: VideoReasonBench video and question distributions.
  • Figure 5: Performance of Gemini-2.5-Flash with varying thinking budgets on five benchmarks. The "Generated Tokens" is the sum of "Thinking Tokens" and "Response Tokens".
  • ...and 1 more figures