Table of Contents
Fetching ...

Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

Henghui Du, Chunjie Zhang, Xi Chen, Chang Zhou, Di Hu

TL;DR

The paper tackles long video question answering by introducing VideoDetective, a model that compresses visual tokens with a small set of learnable memory tokens and recurrently aggregates memory across sub-segments to maintain history context. It integrates an efficient question-aware memory mechanism into a standard visual encoder–LLM pipeline, enabling processing of hour-long videos with limited context length and modest GPU memory. To evaluate true long-context understanding, the authors present GLVC, a dataset that grounds concrete clues and timestamps throughout entire videos. Experiments across multiple benchmarks show competitive or superior performance with significantly improved memory efficiency, demonstrating the method's potential for hours-long video understanding tasks.

Abstract

Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.

Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

TL;DR

The paper tackles long video question answering by introducing VideoDetective, a model that compresses visual tokens with a small set of learnable memory tokens and recurrently aggregates memory across sub-segments to maintain history context. It integrates an efficient question-aware memory mechanism into a standard visual encoder–LLM pipeline, enabling processing of hour-long videos with limited context length and modest GPU memory. To evaluate true long-context understanding, the authors present GLVC, a dataset that grounds concrete clues and timestamps throughout entire videos. Experiments across multiple benchmarks show competitive or superior performance with significantly improved memory efficiency, demonstrating the method's potential for hours-long video understanding tasks.

Abstract

Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.

Paper Structure

This paper contains 23 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We present VideoDetective, an Multi-modal Large Language Model equipped with efficient question-aware memory mechanism. As shown in Fig. \ref{['fig:teaser-a']}, it features recurrently seeking critical clues related to question from minutes or even hours-long videos. Compared to entire videos, only a few special tokens are used to answer question, thus saving GPU memory usage while effectively leveraging crucial information. Fig. \ref{['fig:teaser-bc']} shows the Visual Needle-In-The-Haystack evaluation zhang2024long and inference efficiency. Our proposed efficient question-aware memory mechanism enables models with limited context length, such as Qwen2.5-VL, to efficiently process $4K$ video frames input, requiring only $2$ minutes and $37$GB GPU memory usage. Moreover, compared to other long video understanding models, VideoDetective could more effectively seek critical "needles" from video haystack, demonstrating superior long video understanding capabilities.
  • Figure 2: (a) The architecture of our VideoDetective model. The video segment is divided into multiple sub-segments, which are processed by visual encoder to get multi-modal embeddings. Then these embeddings are separated by special <split> tokens and the model processes each sub-segment recurrently. (b) The question-aware memory mechanism in attention module at every transformer layer of LLMs. During the process of processing each sub-segment, only few special memory tokens <memory> are appended at the end of sub-segment sequence. Then the memory tokens from all past sub-segments and current sub-segment perform attention calculations with other tokens, and then all the memory tokens are aggregated and stored as historical information for subsequent sub-segments.
  • Figure 3: The training and inference process.
  • Figure 4: The overview of GLVC (Grounding Long Video Clues) dataset.
  • Figure 5: The impact of compression ratio $\alpha$.
  • ...and 1 more figures