DrVideo: Document Retrieval Based Long Video Understanding
Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, Jianfei Cai
TL;DR
DrVideo tackles long video understanding by transforming videos into long textual documents and applying a document retrieval and multi-stage agent loop to progressively augment information for LLM-based reasoning. By converting frame-level content into documents, retrieving key frames, and iteratively querying missing details, DrVideo achieves substantial improvements over prior LLM-based methods on EgoSchema, MovieChat-1K, and the long Video-MME split. The approach emphasizes zero-shot capability, modular augmentation, and explainability via chain-of-thought predictions, highlighting a practical path for scalable, long-range video reasoning. Limitations include token-length boundaries and the reliance on capable vision-language and language models, suggesting further gains with stronger VLM/LLM backbones.
Abstract
Most of the existing methods for video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling long videos. The increased number of frames in long videos poses two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo first transforms a long video into a coarse text-based long document to initially retrieve key frames and then updates the documents with the augmented key frame information. It then employs an agent-based iterative loop to continuously search for missing information and augment the document until sufficient question-related information is gathered for making the final predictions in a chain-of-thought manner. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo significantly outperforms existing LLM-based state-of-the-art methods on EgoSchema benchmark (3 minutes), MovieChat-1K benchmark (10 minutes), and the long split of Video-MME benchmark (average of 44 minutes).
