Table of Contents
Fetching ...

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, Yali Wang

TL;DR

VideoChat-A1 tackles long-form video understanding by introducing a chain-of-shot reasoning paradigm that explicitly treats shots as the core units. Through iterative Shot Selection, Shot Partition, and Shot Reflection, guided by a video glance and LongCLIP-based retrieval, the approach performs coarse-to-fine analysis of relevant shots with multi-round MLLM reasoning. It achieves state-of-the-art or competitive results on EgoSchema, LongVideoBench, MLVU, and VideoMME, while using far fewer input frames and lower inference time than large closed-source models. The work demonstrates that shot-aware, interactive reasoning yields higher fidelity and efficiency for long-video QA, offering practical benefits for scalable multimodal understanding.

Abstract

The recent advance in video understanding has been driven by multimodal large language models (MLLMs). But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. However, most existing agents ignore the key fact that a long video is composed with multiple shots, i.e., to answer the user question from a long video, it is critical to deeply understand its relevant shots like human. Without such insight, these agents often mistakenly find redundant even noisy temporal context, restricting their capacity for long video understanding. To fill this gap, we propose VideoChat-A1, a novel long video agent paradigm. Different from the previous works, our VideoChat-A1 can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm. More specifically, it can progressively select the relevant shots of user question, and look into these shots in a coarse-to-fine partition. By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process, allowing to interactively discover preferable temporal context for thoughtful understanding in long videos. Extensive experiments show that, our VideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks, e.g., it achieves 77.0 on VideoMME and 70.1 on EgoSchema, outperforming its strong baselines (e.g., Intern2.5VL-8B and InternVideo2.5-8B), by up to 10.8\% and 6.2\%. Compared to leading close-source GPT-4o and Gemini 1.5 Pro, VideoChat-A1 offers competitive accuracy, but with 7\% input frames and 12\% inference time on average.

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

TL;DR

VideoChat-A1 tackles long-form video understanding by introducing a chain-of-shot reasoning paradigm that explicitly treats shots as the core units. Through iterative Shot Selection, Shot Partition, and Shot Reflection, guided by a video glance and LongCLIP-based retrieval, the approach performs coarse-to-fine analysis of relevant shots with multi-round MLLM reasoning. It achieves state-of-the-art or competitive results on EgoSchema, LongVideoBench, MLVU, and VideoMME, while using far fewer input frames and lower inference time than large closed-source models. The work demonstrates that shot-aware, interactive reasoning yields higher fidelity and efficiency for long-video QA, offering practical benefits for scalable multimodal understanding.

Abstract

The recent advance in video understanding has been driven by multimodal large language models (MLLMs). But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. However, most existing agents ignore the key fact that a long video is composed with multiple shots, i.e., to answer the user question from a long video, it is critical to deeply understand its relevant shots like human. Without such insight, these agents often mistakenly find redundant even noisy temporal context, restricting their capacity for long video understanding. To fill this gap, we propose VideoChat-A1, a novel long video agent paradigm. Different from the previous works, our VideoChat-A1 can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm. More specifically, it can progressively select the relevant shots of user question, and look into these shots in a coarse-to-fine partition. By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process, allowing to interactively discover preferable temporal context for thoughtful understanding in long videos. Extensive experiments show that, our VideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks, e.g., it achieves 77.0 on VideoMME and 70.1 on EgoSchema, outperforming its strong baselines (e.g., Intern2.5VL-8B and InternVideo2.5-8B), by up to 10.8\% and 6.2\%. Compared to leading close-source GPT-4o and Gemini 1.5 Pro, VideoChat-A1 offers competitive accuracy, but with 7\% input frames and 12\% inference time on average.

Paper Structure

This paper contains 25 sections, 7 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Motivation. Direct reasoning models such as GPT-4o gpt4o and Qwen2.5VL-72B qwen25vl perform global sampling on videos and struggle to focus on key information in long videos. Agent-based methods like VideoTree wang2024videotree often suffer from incorrect or redundant sampling to generate noisy caption for wrong answer, due to the lack of deep thinking on shots in a long video. In contrast, VideoChat-A1 interactively employs shot perception and reasoning via Chain-of-Shot, which progressively looks into relevant shots through a reflective process to achieve superior performance.
  • Figure 2: Framework. VideoChat-A1 introduces a novel Chain-of-Shot Reasoning framework for long video understanding. It progressively refines video analysis through iterative stages of Shot Selection, Shot Partition, and Shot Reflection, leveraging MLLMs to dynamically discover relevant video shots and generate reliable answer.
  • Figure 3: Shot Partition. Given a candidate shot at step $i$, VideoChat-A1 first applies K-Means clustering to obtain K cluster centers for finding key frames. Subsequently, subshots are partitioned based on the feature distance between each frame and its adjacent two key frames.
  • Figure 4: Shot Reasoning and Shot Reflection. At each step, VideoChat-A1 performs question-answering reasoning using the relevant shots identified. It then evaluates the generated answer and historical information to reflect on the confidence level of the response. Based on the confidence score and the number of reasoning iterations, the system either proceeds to the next step for further refinement or terminates the reasoning process to output the final answer. This iterative reflection ensures reliable and contextually accurate responses.
  • Figure 5: Visual comparison of different models.