Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

Henghui Du; Chunjie Zhang; Xi Chen; Chang Zhou; Di Hu

Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

Henghui Du, Chunjie Zhang, Xi Chen, Chang Zhou, Di Hu

TL;DR

The paper tackles long video question answering by introducing VideoDetective, a model that compresses visual tokens with a small set of learnable memory tokens and recurrently aggregates memory across sub-segments to maintain history context. It integrates an efficient question-aware memory mechanism into a standard visual encoder–LLM pipeline, enabling processing of hour-long videos with limited context length and modest GPU memory. To evaluate true long-context understanding, the authors present GLVC, a dataset that grounds concrete clues and timestamps throughout entire videos. Experiments across multiple benchmarks show competitive or superior performance with significantly improved memory efficiency, demonstrating the method's potential for hours-long video understanding tasks.

Abstract

Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.

Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

TL;DR

Abstract

Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)