Table of Contents
Fetching ...

Rethinking Chain-of-Thought Reasoning for Videos

Yiwu Zhong, Zi-Yuan Hu, Yin Li, Liwei Wang

TL;DR

The paper challenges the necessity of long, human-like chain-of-thought reasoning for video understanding. It demonstrates that concise reasoning combined with a reduced set of visual tokens, learned through RL post-training with GRPO and token compression, can yield competitive accuracy with substantially lower inference and training costs. By eliminating CoT annotations and heavy SFT, the approach achieves strong performance across diverse benchmarks while significantly improving efficiency. The findings imply a shift toward efficient, concise reasoning paradigms for multimodal video reasoning and potentially broader visual reasoning tasks.

Abstract

Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.

Rethinking Chain-of-Thought Reasoning for Videos

TL;DR

The paper challenges the necessity of long, human-like chain-of-thought reasoning for video understanding. It demonstrates that concise reasoning combined with a reduced set of visual tokens, learned through RL post-training with GRPO and token compression, can yield competitive accuracy with substantially lower inference and training costs. By eliminating CoT annotations and heavy SFT, the approach achieves strong performance across diverse benchmarks while significantly improving efficiency. The findings imply a shift toward efficient, concise reasoning paradigms for multimodal video reasoning and potentially broader visual reasoning tasks.

Abstract

Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: CoT Reasoning, with dense prefilling and lengthy decoding, incurs substatial computation load at both training and inference. In contrast, Concise Reasoning coupled with token compression is significantly more efficient, thanks to sparse prefilling and concise decoding.
  • Figure 2: Overview of CoT models. After pre-training, they are typically post-trained via SFT stage using CoT annotations and RL stage. For both training and inference, the models suffer from heavy prefilling with dense visual tokens, and lengthy decoding due to human-like thinking generation.
  • Figure 3: Statistics of training and inference overhead. (a) Training overhead shows the training runtime of a CoT model (i.e., Video-R1 video-r1), which is measured via four A800-SXM4-80GB GPUs. (b) Inference overhead reports the inference statistics (i.e., decoding length and inference runtime) of different reason modes, which is measured through a single A800-SXM4-80GB GPU.
  • Figure 4: Framework of our method. (1) Typical CoT models are trained via three stages and perform long reasoning during inference. (2) In comparison, our method does not require the stage of supervised fine-tuning and the annotations at this stage, and generates concise reasoning during inference. (3) We further reduce computation overhead by trainable token compression.
  • Figure 5: Visualization of generated text from CoT reasoning (Video-R1) and concise reasoning (Ours). Partial text chunks are colored through human validation. Green: ground-truth or correct predicted answers. Blue: correct intermediate reasoning steps. Purple: unnecessary intermediate reasoning steps. Red: incorrect intermediate reasoning steps or final predictions.