Table of Contents
Fetching ...

MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding

Shiwen Cao, Zhaoxing Zhang, Junming Jiao, Juyi Qiao, Guowen Song, Rong Shen, Xiangbing Meng

TL;DR

MASR tackles the challenge of video understanding under large data volumes by introducing a multimodal, hierarchical attention framework with self-reflective reasoning. It combines clip-wise clustering, multimodal coarse-to-fine relevance sensing (MCRS), and dilated temporal expansion (DTE) to focus on query-relevant content, while a single LLM provides iterative reflection to refine context until a confident answer is produced. Empirical results across EgoSchema, NExT-QA, IntentQA, and Video-MME demonstrate state-of-the-art accuracy and robustness, with ablations confirming the critical roles of MCRS, DTE, and self-reflection. The approach offers a training-free, plug-and-play alternative to heavy fine-tuning, with practical impact on long-form video QA and real-time interpretation tasks, while highlighting avenues to reduce latency and optimize context retention in future work.

Abstract

Even in the era of rapid advances in large models, video understanding remains a highly challenging task. Compared to texts or images, videos commonly contain more information with redundancy, requiring large models to properly allocate attention at a global level for comprehensive and accurate understanding. To address this, we propose a Multimodal hierarchical Attention focusing Self-reflective Reasoning (MASR) framework for agent-based video understanding. The key innovation lies in its ability to detect and prioritize segments of videos that are highly relevant to the query. Firstly, MASR realizes Multimodal Coarse-to-fine Relevance Sensing (MCRS) which enhances the correlation between the acquired contextual information and the query. Secondly, MASR employs Dilated Temporal Expansion (DTE) to mitigate the risk of missing crucial details when extracting semantic information from the focused frames selected through MCRS. By iteratively applying MCRS and DTE in the self-reflective reasoning process, MASR is able to adaptively adjust the attention to extract highly query-relevant context and therefore improve the response accuracy. In the EgoSchema dataset, MASR achieves a remarkable 5% performance gain over previous leading approaches. In the Next-QA and IntentQA datasets, it outperforms the state-of-the-art standards by 0.2% and 0.3% respectively. In the Video-MME dataset that contains long-term videos, MASR also performs better than other agent-based methods.

MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding

TL;DR

MASR tackles the challenge of video understanding under large data volumes by introducing a multimodal, hierarchical attention framework with self-reflective reasoning. It combines clip-wise clustering, multimodal coarse-to-fine relevance sensing (MCRS), and dilated temporal expansion (DTE) to focus on query-relevant content, while a single LLM provides iterative reflection to refine context until a confident answer is produced. Empirical results across EgoSchema, NExT-QA, IntentQA, and Video-MME demonstrate state-of-the-art accuracy and robustness, with ablations confirming the critical roles of MCRS, DTE, and self-reflection. The approach offers a training-free, plug-and-play alternative to heavy fine-tuning, with practical impact on long-form video QA and real-time interpretation tasks, while highlighting avenues to reduce latency and optimize context retention in future work.

Abstract

Even in the era of rapid advances in large models, video understanding remains a highly challenging task. Compared to texts or images, videos commonly contain more information with redundancy, requiring large models to properly allocate attention at a global level for comprehensive and accurate understanding. To address this, we propose a Multimodal hierarchical Attention focusing Self-reflective Reasoning (MASR) framework for agent-based video understanding. The key innovation lies in its ability to detect and prioritize segments of videos that are highly relevant to the query. Firstly, MASR realizes Multimodal Coarse-to-fine Relevance Sensing (MCRS) which enhances the correlation between the acquired contextual information and the query. Secondly, MASR employs Dilated Temporal Expansion (DTE) to mitigate the risk of missing crucial details when extracting semantic information from the focused frames selected through MCRS. By iteratively applying MCRS and DTE in the self-reflective reasoning process, MASR is able to adaptively adjust the attention to extract highly query-relevant context and therefore improve the response accuracy. In the EgoSchema dataset, MASR achieves a remarkable 5% performance gain over previous leading approaches. In the Next-QA and IntentQA datasets, it outperforms the state-of-the-art standards by 0.2% and 0.3% respectively. In the Video-MME dataset that contains long-term videos, MASR also performs better than other agent-based methods.

Paper Structure

This paper contains 14 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: A comparison of the two mainstream MLLM-based video understanding frameworks: Video-MLLM-based and Agent-based. In particular, the purple-highlighted sections in the agent-based method indicate the creative works in our MASR framework.
  • Figure 2: A illustration of the complete MASR pipeline, where the leftmost section displays the input query and all video frames and the other sections shows core modules: Step 1 employs "selection" to denote the coarse attention focusing process that identifies highly relevant clip candidates based on semantic features; Step 2 utilizes "focusing" to represent the fine attention focusing process that pinpoints highly relevant frames through semantic-visual feature similarity matching; Step 3 performs DTE on selected frames; Step 4 extracts semantic features from expanded frames via VLM as contextual information for question answering; Step 5 generates responses while evaluating confidence scores to determine whether to output directly or reiterate the focusing-selection process for missing information. Notably, a single LLM in Step 5 serves as a reflector for response generation, confidence evaluation, and Step 1's coarse attention focusing. Since the context for initial coarse focusing remain empty during the first self-reflection round, MASR directly input clustered center frames obtained in the initialization stage as highly relevant frames to Step 3. This is indicated by dashed lines in the diagram.
  • Figure 3: A example of the implementation of fine-grained relevance sensing process. Between the two coarsely-selected video clip candidates [A, B] and [C, D], our fine-focusing algorithm determines that the latter segment exhibits higher relevance to the query, as it contains more relevant visual tokens. Consequently, we select frame d, which has the highest similarity score in the video clip [C, D].
  • Figure 4: A example of DTE process in Step 3 with the parameters $w=7$, $r=2$, $wn=3$ and $s=3$ showing a total of 9 frames are expanded through DTE for each fine-focused frame. These parameters can be adjusted adaptively.
  • Figure 5: A demonstration of MASR's reflective reasoning processes to answer questions in the EgoSchema dataset.
  • ...and 2 more figures