OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

Lu Zhang; Tiancheng Zhao; Heting Ying; Yibo Ma; Kyusong Lee

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, Kyusong Lee

TL;DR

OmAgent tackles the problem of understanding ultra-long videos by combining multimodal retrieval-augmented generation with an autonomous Divide-and-Conquer Loop. It introduces Video2RAG preprocessing to store rich, timestamped scene content and a rewinder-enabled DnC Loop that decomposes tasks, invokes tools, and revisits video segments as needed. A new long-form video benchmark with 2000+ Q&A pairs demonstrates that OmAgent outperforms strong baselines, highlighting improved reasoning, localization, and information synthesis. The approach reduces information loss inherent in frame-based or text-only representations and offers scalable, agent-driven video understanding across diverse content types.

Abstract

Recent advancements in Large Language Models (LLMs) have expanded their capabilities to multimodal contexts, including comprehensive video understanding. However, processing extensive videos such as 24-hour CCTV footage or full-length films presents significant challenges due to the vast data and processing demands. Traditional methods, like extracting key frames or converting frames to text, often result in substantial information loss. To address these shortcomings, we develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries, preserving the detailed content of videos. Additionally, it features an Divide-and-Conquer Loop capable of autonomous reasoning, dynamically invoking APIs and tools to enhance query processing and accuracy. This approach ensures robust video understanding, significantly reducing information loss. Experimental results affirm OmAgent's efficacy in handling various types of videos and complex tasks. Moreover, we have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

TL;DR

Abstract

Paper Structure (34 sections, 4 figures, 6 tables, 1 algorithm)

This paper contains 34 sections, 4 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Video LLMs
Long Video Understanding system with LLMs
MultiModal RAG
Method
Video to RAG
Visual Prompting
Text Representation of Audio
Scene Caption
Encode and Save
Divide-and-Conquer Loop
Conqueror
Divider
Rescuer
...and 19 more sections

Figures (4)

Figure 1: How OmAgent understand video. In Video2RAG, the video is processed by different algorithms (e.g. Scene Detection, ASR and face recognition) and then summarized by MLLMs to generate Scene Captions. Those captions are encoded and saved in the knowledge database. When OmAgent receives a query, it filters and retrieves in knowledge database based on timestamps (if available). The retrieved information is processed by the Divide-and-Conquer Loop and summarized by Conclusive Synthesis to generate the final answer.
Figure 2: Divider and Conqueror Loop task-solving procedure. In the DnC Loop, simple problems are directly executed by Conqueror, while complex problems are split by Divider until they can be executed. The Rescuer recognizes exceptions and retries the task. The Tool Manager organizes the external tools. It is worth mentioning that the Rewinder tool can goes back through the entire video to find information and missing details. Finally, the DnC loop outputs the relevant content whether the execution fails or succeeds.
Figure 3: case 1
Figure 4: case 2

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

TL;DR

Abstract

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

Authors

TL;DR

Abstract

Table of Contents

Figures (4)