Table of Contents
Fetching ...

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu

TL;DR

This work introduces Thinking with Video, a unified multimodal reasoning paradigm that uses video generation to bridge visual and textual reasoning. It presents VideoThinkBench, a comprehensive benchmark with vision-centric and text-centric tasks, and evaluates Sora-2 against state-of-the-art VLMs, revealing competitive performance on spatial tasks and strong text-centric capabilities, often aided by video-embedded text. The study further analyzes factors like few-shot learning and self-consistency, and investigates the origins of text-centric reasoning, highlighting prompt rewriting as a major contributor. The findings suggest video-generation models may serve as versatile, unified reasoning engines, with future work expanding benchmarks, exploring additional models, and leveraging RL-based training and pretraining on text-to-video data to enhance multimodal understanding and generation.

Abstract

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

TL;DR

This work introduces Thinking with Video, a unified multimodal reasoning paradigm that uses video generation to bridge visual and textual reasoning. It presents VideoThinkBench, a comprehensive benchmark with vision-centric and text-centric tasks, and evaluates Sora-2 against state-of-the-art VLMs, revealing competitive performance on spatial tasks and strong text-centric capabilities, often aided by video-embedded text. The study further analyzes factors like few-shot learning and self-consistency, and investigates the origins of text-centric reasoning, highlighting prompt rewriting as a major contributor. The findings suggest video-generation models may serve as versatile, unified reasoning engines, with future work expanding benchmarks, exploring additional models, and leveraging RL-based training and pretraining on text-to-video data to enhance multimodal understanding and generation.

Abstract

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.

Paper Structure

This paper contains 49 sections, 1 equation, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Vision-centric tasks and text-centric tasks in VideoThinkBench, and Sora-2's "Thinking with Video" solutions. Vision-centric tasks are solved by reasoning about visual elements via drawing and imagination, including four categories: eyeballing puzzles, visual puzzles, ARC-AGI-2, and mazes. An example is shown for each. Typically, in the "ray reflection" problem from the eyeballing puzzles, Sora-2 accurately draws the light path and finds the specific point it passes through. Text-centric tasks are solved by text-based reasoning, which are adapted from established benchmarks, and a GSM8K example is shown. The model provides a written process and the correct answer within the video.
  • Figure 2: Four Examples of Sora-2 solving our custom benchmark of 21 eyeballing tasks and 1050 samples. Each sample is a multiple choice question and includes an input image with text prompt. The benchmark is automatically evaluated and verifiable. See Section \ref{['sec:spatial_reasoning']} for details and prompts. In the bottom two examples, Sora-2 adds "Charlie" text on options that are not "C". This disparity between modalities is further explored in Section \ref{['sec:output_form']}. All prompts can be found in Section \ref{['appendix_sec:eyeballing_prompts']}.
  • Figure 3: Overview of the visual puzzles, categorized into color-fillings tasks and shape-drawing tasks. The tasks are selected and adapted from PuzzleVQA chia2024puzzlevqa to evaluate inductive reasoning capability. The video generation model need to fill the marked area with the correct color or draw the correct shape. Sora-2 correctly solved the problems above.
  • Figure 4: Examples of Sora-2 trying to solve ARC-AGI-2. ARC-AGI-2 is a benchmark targeting few-shot, inductive reasoning over abstract pattern transformations. Sora-2 is expected to deduct the transform rule from examples and use the rule to generate the output grid of the test case. Besides automatic evaluation, we manually analyzed 100 cases and divide them into 4 categories based on completion level. Prompt: "Each row contains input and output grids. Learn the pattern and generate the output grid for the last input while keeping existing patterns without modification. Static camera perspective, no zoom or pan. In portrait." For the generated video and ground truth, only test case area is displayed. Details: Section \ref{['sec:arc_agi_2']}
  • Figure 5: Input form and evaluation of text-centric tasks. The model accepts a text prompt and a reference image. The prompt contains the problem text and the reference image displays the entire problem. The model shows the textual solution process and the answer in the video, speaking the answer in the audio. We evaluate the answers from the video and audio independently. The last frame is extracted for video evaluation and the audio is transcribed for audio evaluation. For evaluation, we adopt an LLM-as-a-Judge approach, detailed in Section \ref{['sec:text-centric_eval']}, and human Alignment check is shown in \ref{['app:human_alignment_text']}.
  • ...and 9 more figures