Table of Contents
Fetching ...

CogStream: Context-guided Streaming Video Question Answering

Zicheng Zhao, Kangyu Wang, Shijie Li, Rui Qian, Weiyao Lin, Huabin Liu

TL;DR

CogStream tackles streaming video reasoning by requiring models to selectively leverage relevant historical context for each current question. It introduces a densely annotated, hierarchical QA dataset and a baseline, CogReasoner, that jointly compresses visual streams and retrieves pertinent dialogue for reasoning. The framework combines Temporal-Semantic Clustering for event-level compression with a historic dialogue retrieval module and a video-text interleaving reasoning step, showing robustness to noisy long-range context. The results indicate significant gains over existing streaming VQA baselines and demonstrate practical benefits in efficiency and accuracy for long-span streaming QA tasks.

Abstract

Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It effectively tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method.

CogStream: Context-guided Streaming Video Question Answering

TL;DR

CogStream tackles streaming video reasoning by requiring models to selectively leverage relevant historical context for each current question. It introduces a densely annotated, hierarchical QA dataset and a baseline, CogReasoner, that jointly compresses visual streams and retrieves pertinent dialogue for reasoning. The framework combines Temporal-Semantic Clustering for event-level compression with a historic dialogue retrieval module and a video-text interleaving reasoning step, showing robustness to noisy long-range context. The results indicate significant gains over existing streaming VQA baselines and demonstrate practical benefits in efficiency and accuracy for long-span streaming QA tasks.

Abstract

Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It effectively tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method.

Paper Structure

This paper contains 38 sections, 6 equations, 20 figures, 10 tables, 1 algorithm.

Figures (20)

  • Figure 1: Illustration of CogStream. Given a streaming video, users continuously interact with models by asking questions. Both the video data and the history of QA dialogue grow with the stream. To answer the latest question, models must deduce the answer from relevant historical context, thereby forming the dialogue stream. Our CogReasoner addresses this task by compressing streaming video based on current questions and accurately retrieving relevant historical QAs to deduce the answer.
  • Figure 2: Illustration of different QA settings and type distribution in the dataset. Top: Streaming QA. Bottom-left: Basic QA (left) and Global QA (right). Bottom-right: Distribution of QA types.
  • Figure 3: The generation pipeline of CogStream dataset.
  • Figure 4: The overview of CogReasoner. It comprises three modules: the Visual Stream Compression uses Temporal-Semantic Clustering and Question-aware Streaming Compression to process video streams into relevant events; the Historic Dialogue Retrieval employs an LLM to select relevant historical QA pairs and assess visual input necessity; the Video-text Interleave Reasoning interleaves visual and textual tokens time-sequentially for answer generation.
  • Figure A.1: Distribution of the raw video sources
  • ...and 15 more figures