Table of Contents
Fetching ...

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu

TL;DR

VideoEspresso addresses the scarcity of large-scale, fine-grained video reasoning data by automating QA and multimodal chain-of-thought annotations through semantic-aware frame pruning and GPT-4o generation. It couples this dataset with a Hybrid LVLMs Collaboration framework that uses a tiny Frame Selector and a two-stage reasoning LVLM to achieve efficient, grounded videoQA. Across 14 tasks and 9 LVLM baselines, the approach delivers state-of-the-art performance on most tasks and notable efficiency gains, validating the value of core-frame selection and visual CoT in video understanding. The work provides a scalable blueprint for building and leveraging high-quality VideoQA data to advance multimodal reasoning in video domains.

Abstract

The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

TL;DR

VideoEspresso addresses the scarcity of large-scale, fine-grained video reasoning data by automating QA and multimodal chain-of-thought annotations through semantic-aware frame pruning and GPT-4o generation. It couples this dataset with a Hybrid LVLMs Collaboration framework that uses a tiny Frame Selector and a two-stage reasoning LVLM to achieve efficient, grounded videoQA. Across 14 tasks and 9 LVLM baselines, the approach delivers state-of-the-art performance on most tasks and notable efficiency gains, validating the value of core-frame selection and visual CoT in video understanding. The work provides a scalable blueprint for building and leveraging high-quality VideoQA data to advance multimodal reasoning in video domains.

Abstract

The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso

Paper Structure

This paper contains 23 sections, 4 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of VideoEspresso. (a) Comparison of annotation pipelines: Unlike traditional videoQA datasets, VideoEspresso features an automatic pipeline for constructing complex reasoning QA tasks and multimodal Chain-of-Thought (CoT) annotations. This enhances the diversity of QA data and significantly improves scalability. (b) Examples from VideoEspresso: Illustrated are sample question-answer pairs, along with CoT bounding boxes and evidence annotations, demonstrating the dataset's richness. (c) Benchmark performance: Comparative results on our benchmark highlight the video reasoning capabilities of our model.
  • Figure 2: The automatic generation pipeline of VideoEspresso. (i) Question-Answer Pair Construction: We use video frame-leveled captions to extract the key frames of the video and group descriptions of these frames. Then, we prompt GPT-4 to design questions for each group of video frames. (ii) Multimodal Chain-of-Thought Annotation: We extract key evidence text and generate captions with the highest relevance to the question with GPT-4o. Additionally, we annotate spatial and temporal information for key items, which results in multimodal Chain of Thought data pairs grounded in both temporal and spatial dimensions.
  • Figure 3: The statistical analysis of our VideoEspresso dataset.
  • Figure 4: The dataset attributes comparison between our VideoEspresso and MVbench.
  • Figure 5: Two-Stage Video Evidence of Thought Training Procedure. The Frame Selector comprises a tiny LVLM and a tiny LLM, tasked with generating captions for videos and selecting the most relevant frame to as core video token for large reasoning model. A two-stage supervised fine-tuning technique is employed. During stage-1, a set of cue prompts is introduced to guide the model in producing evidence, while in stage-2, the evidence generated from stage-1 is concatenated and used directly to guide the answer generation.
  • ...and 9 more figures