Table of Contents
Fetching ...

StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

Yuhang Hu, Zhenyu Yang, Shihan Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Changsheng Xu

TL;DR

StreamingCoT addresses the limitations of static annotations and opaque reasoning in VideoQA by introducing a dynamic, temporally hierarchical annotation framework and a multimodal Chain-of-Thought generation pipeline. It replaces single-clip labels with per-second dense captions fused into semantic segments, constructs dynamically evolving QA pairs, and synthesizes spatiotemporally grounded CoT traces via keyframe alignment and visual grounding, all validated by human experts. The dataset comprises 5,000 videos, 243,185 time-aligned captions yielding 68,940 semantic segments, 34,470 dynamic QA pairs, and 68,940 CoT annotations with 206,820 bounding boxes, spanning 32 thematic categories. By enabling explicit, auditable reasoning paths in streaming video, StreamingCoT provides a new benchmark and toolkit for temporal multimodal understanding with strong potential for improving interpretability and reasoning fidelity in real-world streaming scenarios.

Abstract

The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.

StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

TL;DR

StreamingCoT addresses the limitations of static annotations and opaque reasoning in VideoQA by introducing a dynamic, temporally hierarchical annotation framework and a multimodal Chain-of-Thought generation pipeline. It replaces single-clip labels with per-second dense captions fused into semantic segments, constructs dynamically evolving QA pairs, and synthesizes spatiotemporally grounded CoT traces via keyframe alignment and visual grounding, all validated by human experts. The dataset comprises 5,000 videos, 243,185 time-aligned captions yielding 68,940 semantic segments, 34,470 dynamic QA pairs, and 68,940 CoT annotations with 206,820 bounding boxes, spanning 32 thematic categories. By enabling explicit, auditable reasoning paths in streaming video, StreamingCoT provides a new benchmark and toolkit for temporal multimodal understanding with strong potential for improving interpretability and reasoning fidelity in real-world streaming scenarios.

Abstract

The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.

Paper Structure

This paper contains 26 sections, 11 equations, 3 figures.

Figures (3)

  • Figure 1: StreamingCoT Pipeline: Illustrates the hierarchical framework for dataset construction, comprising: (1) Geographically balanced video collection with multimodal filtering; (2) Adaptive temporal segmentation via Dynamic Semantic Fusion (DSF) and context-aware dense captioning; (3) Dynamic QA pair generation constrained by temporal evolution patterns, featuring distractor-aware option design; (4) Multimodal Chain-of-Thought synthesis integrating temporally verified reasoning, key object grounding, and spatiotemporal evidence fusion; (5) Iterative human validation ensuring spatiotemporal consistency and reasoning integrity throughout the workflow.
  • Figure 2: Dataset Statistics
  • Figure 3: Examples of question types and temporal evidence accumulation in the StreamingCoT dataset