MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding

Shiwen Cao; Zhaoxing Zhang; Junming Jiao; Juyi Qiao; Guowen Song; Rong Shen; Xiangbing Meng

MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding

Shiwen Cao, Zhaoxing Zhang, Junming Jiao, Juyi Qiao, Guowen Song, Rong Shen, Xiangbing Meng

TL;DR

MASR tackles the challenge of video understanding under large data volumes by introducing a multimodal, hierarchical attention framework with self-reflective reasoning. It combines clip-wise clustering, multimodal coarse-to-fine relevance sensing (MCRS), and dilated temporal expansion (DTE) to focus on query-relevant content, while a single LLM provides iterative reflection to refine context until a confident answer is produced. Empirical results across EgoSchema, NExT-QA, IntentQA, and Video-MME demonstrate state-of-the-art accuracy and robustness, with ablations confirming the critical roles of MCRS, DTE, and self-reflection. The approach offers a training-free, plug-and-play alternative to heavy fine-tuning, with practical impact on long-form video QA and real-time interpretation tasks, while highlighting avenues to reduce latency and optimize context retention in future work.

Abstract

Even in the era of rapid advances in large models, video understanding remains a highly challenging task. Compared to texts or images, videos commonly contain more information with redundancy, requiring large models to properly allocate attention at a global level for comprehensive and accurate understanding. To address this, we propose a Multimodal hierarchical Attention focusing Self-reflective Reasoning (MASR) framework for agent-based video understanding. The key innovation lies in its ability to detect and prioritize segments of videos that are highly relevant to the query. Firstly, MASR realizes Multimodal Coarse-to-fine Relevance Sensing (MCRS) which enhances the correlation between the acquired contextual information and the query. Secondly, MASR employs Dilated Temporal Expansion (DTE) to mitigate the risk of missing crucial details when extracting semantic information from the focused frames selected through MCRS. By iteratively applying MCRS and DTE in the self-reflective reasoning process, MASR is able to adaptively adjust the attention to extract highly query-relevant context and therefore improve the response accuracy. In the EgoSchema dataset, MASR achieves a remarkable 5% performance gain over previous leading approaches. In the Next-QA and IntentQA datasets, it outperforms the state-of-the-art standards by 0.2% and 0.3% respectively. In the Video-MME dataset that contains long-term videos, MASR also performs better than other agent-based methods.

MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding

TL;DR

Abstract

MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)