Table of Contents
Fetching ...

Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

Haiyang Yan, Hongyun Zhou, Peng Xu, Xiaoxue Feng, Mengyi Liu

Abstract

Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.

Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

Abstract

Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.
Paper Structure (23 sections, 7 equations, 11 figures, 8 tables)

This paper contains 23 sections, 7 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: (a) Single-agent approaches face limitations in reasoning capacity when handling long-video understanding tasks requiring multi-step reasoning. (b) Our proposed multi-agent system achieves enhanced reasoning capabilities through task decomposition and collaboration along functional dimensions.
  • Figure 2: The reflection-enhanced dynamic reasoning framework in Symphony. The planning agent formulates a task plan and dynamically invokes other agents to execute subtasks. Upon obtaining an initial solution, the Reflection agent evaluates the reasoning chain $\tau$, producing a critique $\mathcal{C}$ that guides a subsequent round of reasoning exploration.
  • Figure 3: The CLIP-based method utilizes the original query for retrieval, thereby failing to capture abstract concepts and actions within temporal sequences. Our grounding agent analyzes the query, expands and refines the relevant concepts, and utilizes VLM to evaluate the similarity between the enhanced query and each segment, achieving more comprehensive grounding results.
  • Figure 4: Experimental results of different agent-based methods.
  • Figure S5: Analysis of the reasoning trajectories generated by our proposed Symphony and the single-agent DVD.
  • ...and 6 more figures