Table of Contents
Fetching ...

DrVideo: Document Retrieval Based Long Video Understanding

Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, Jianfei Cai

TL;DR

DrVideo tackles long video understanding by transforming videos into long textual documents and applying a document retrieval and multi-stage agent loop to progressively augment information for LLM-based reasoning. By converting frame-level content into documents, retrieving key frames, and iteratively querying missing details, DrVideo achieves substantial improvements over prior LLM-based methods on EgoSchema, MovieChat-1K, and the long Video-MME split. The approach emphasizes zero-shot capability, modular augmentation, and explainability via chain-of-thought predictions, highlighting a practical path for scalable, long-range video reasoning. Limitations include token-length boundaries and the reliance on capable vision-language and language models, suggesting further gains with stronger VLM/LLM backbones.

Abstract

Most of the existing methods for video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling long videos. The increased number of frames in long videos poses two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo first transforms a long video into a coarse text-based long document to initially retrieve key frames and then updates the documents with the augmented key frame information. It then employs an agent-based iterative loop to continuously search for missing information and augment the document until sufficient question-related information is gathered for making the final predictions in a chain-of-thought manner. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo significantly outperforms existing LLM-based state-of-the-art methods on EgoSchema benchmark (3 minutes), MovieChat-1K benchmark (10 minutes), and the long split of Video-MME benchmark (average of 44 minutes).

DrVideo: Document Retrieval Based Long Video Understanding

TL;DR

DrVideo tackles long video understanding by transforming videos into long textual documents and applying a document retrieval and multi-stage agent loop to progressively augment information for LLM-based reasoning. By converting frame-level content into documents, retrieving key frames, and iteratively querying missing details, DrVideo achieves substantial improvements over prior LLM-based methods on EgoSchema, MovieChat-1K, and the long Video-MME split. The approach emphasizes zero-shot capability, modular augmentation, and explainability via chain-of-thought predictions, highlighting a practical path for scalable, long-range video reasoning. Limitations include token-length boundaries and the reliance on capable vision-language and language models, suggesting further gains with stronger VLM/LLM backbones.

Abstract

Most of the existing methods for video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling long videos. The increased number of frames in long videos poses two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo first transforms a long video into a coarse text-based long document to initially retrieve key frames and then updates the documents with the augmented key frame information. It then employs an agent-based iterative loop to continuously search for missing information and augment the document until sufficient question-related information is gathered for making the final predictions in a chain-of-thought manner. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo significantly outperforms existing LLM-based state-of-the-art methods on EgoSchema benchmark (3 minutes), MovieChat-1K benchmark (10 minutes), and the long split of Video-MME benchmark (average of 44 minutes).
Paper Structure (19 sections, 4 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 4 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of our DrVideo framework. It comprises five components: a video-document conversion module, a retrieval module, a document augmentation module, a multi-stage agent interaction loop and an answering module.
  • Figure 2: Illustration of the multi-stage agent interaction loop and answering module. There are two agents in the multi-stage agent interaction loop: a planning agent to plan the next step and an interaction agent to dynamically find missing information and interact with the document augmentation module.
  • Figure 3: Performance of different rounds with DrVideo and VideoAgent wang2024videoagent on EgoSchema. To align with VideoAgent, GPT-4, i.e., gpt-4-turbo-1106-preview, is used as the LLM agents.
  • Figure 4: Case study on an instance from EgoSchema. DrVideo accurately identifies key frames and chooses the correct answer.
  • Figure 5: Case study on an instance from Video-MME. Long Case of DrVideo. This video contains 33 minutes.
  • ...and 1 more figures