Table of Contents
Fetching ...

Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

Yufei Yin, Yuchen Xing, Qianke Meng, Minghao Chen, Yan Yang, Zhou Yu

Abstract

Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3\% on EgoSchema, 80.5\% on NExT-QA, and 77.7\% on IntentQA, while using fewer frames than previous training-free methods.

Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

Abstract

Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3\% on EgoSchema, 80.5\% on NExT-QA, and 77.7\% on IntentQA, while using fewer frames than previous training-free methods.

Paper Structure

This paper contains 27 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Conceptual comparisons of three Video understanding methods based on LLM (MLLM). Existing approaches either (a) directly reason with pre-trained video-based MLLMs at high computational cost, or (b) use LLM agents that convert frames to captions and rely on VLMs (e.g., CLIP) to filter key frames, discarding fine-grained visual cues. In contrast, our ProVCA (c) progressively condenses videos by using an MLLM to narrow down query-relevant content from coarse segments to fine frames, then feeds the selected key frames into the MLLM to generate the final answer.
  • Figure 2: Overview of ProVCA in video understanding based on MLLM.