Table of Contents
Fetching ...

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu

TL;DR

This work tackles the dual challenge of fine-grained spatiotemporal perception and cognitive-level video reasoning. It introduces MotionEpic, a video MLLM that grounds content through structured STSG representations, and the Video-of-Thought (VoT) framework that decomposes complex video tasks into a sequence of manageable steps from pixel grounding to semantic inference. Across eight challenging video QA benchmarks, the combination of MotionEpic and VoT yields state-of-the-art results, with detailed analyses demonstrating improved grounding reliability and reasoning breakdown, as well as strong zero-shot performance. The approach offers a scalable path toward human-like video understanding and reasoning, with potential applications in diverse video understanding tasks and a framework for future extensions with broader commonsense and causal knowledge.

Abstract

Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. Building upon MotionEpic, we then develop a Video-of-Thought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-by-step from a low-level pixel perception to high-level cognitive interpretation. Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the CoT technique for achieving human-level video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios. Project is open at https://haofei.vip/VoT

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

TL;DR

This work tackles the dual challenge of fine-grained spatiotemporal perception and cognitive-level video reasoning. It introduces MotionEpic, a video MLLM that grounds content through structured STSG representations, and the Video-of-Thought (VoT) framework that decomposes complex video tasks into a sequence of manageable steps from pixel grounding to semantic inference. Across eight challenging video QA benchmarks, the combination of MotionEpic and VoT yields state-of-the-art results, with detailed analyses demonstrating improved grounding reliability and reasoning breakdown, as well as strong zero-shot performance. The approach offers a scalable path toward human-like video understanding and reasoning, with potential applications in diverse video understanding tasks and a framework for future extensions with broader commonsense and causal knowledge.

Abstract

Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. Building upon MotionEpic, we then develop a Video-of-Thought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-by-step from a low-level pixel perception to high-level cognitive interpretation. Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the CoT technique for achieving human-level video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios. Project is open at https://haofei.vip/VoT
Paper Structure (36 sections, 10 figures, 4 tables)

This paper contains 36 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Human-like video reasoning intuitively follows a multi-step procedure, from lower-level perceptive fine-grained pixel grounding and tracking, to higher-level cognitive action scene semantics understanding.
  • Figure 2: Overview of the MotionEpic video MLLM.
  • Figure 3: The STSG expression generated by MotionEpic, with its corresponding structural STSG illustration.
  • Figure 4: An illustrative view of VoT framework. The complete I/O and prompts are detailed in Appendix.
  • Figure 5: MotionEpic performance on object grounding, scene graph triplet classification, and action grounding.
  • ...and 5 more figures