DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding
Yudong Han, Qingpei Guo, Liyuan Pan, Liu Liu, Yu Guan, Ming Yang
TL;DR
DynFocus tackles the memory bottleneck in LLM-based video understanding by revealing redundancy and question-dependent frame relevance in long videos. It introduces Dynamic Event Prototype Estimation (DPE) to select meaningful frames and Compact Cooperative Encoding (CCE) to encode important frames with fine-grained features (Cones) while summarizing others with coarse, text-guided tokens (Rods). The two-stage training regime aligns video content with language and then fine-tunes an LLM on instruction-following data, achieving competitive or superior results on short- and long-video benchmarks while using far fewer tokens. The approach also demonstrates robustness to video hallucination and shows clear efficiency advantages over state-of-the-art methods, making it attractive for scalable video-language applications. Overall, DynFocus provides a memory-efficient, dynamically adjustable framework that preserves crucial visual details and temporal cues essential for accurate video understanding in LLM-driven systems.
Abstract
The challenge in LLM-based video understanding lies in preserving visual and semantic information in long videos while maintaining a memory-affordable token count. However, redundancy and correspondence in videos have hindered the performance potential of existing methods. Through statistical learning on current datasets, we observe that redundancy occurs in both repeated and answer-irrelevant frames, and the corresponding frames vary with different questions. This suggests the possibility of adopting dynamic encoding to balance detailed video information preservation with token budget reduction. To this end, we propose a dynamic cooperative network, DynFocus, for memory-efficient video encoding in this paper. Specifically, i) a Dynamic Event Prototype Estimation (DPE) module to dynamically select meaningful frames for question answering; (ii) a Compact Cooperative Encoding (CCE) module that encodes meaningful frames with detailed visual appearance and the remaining frames with sketchy perception separately. We evaluate our method on five publicly available benchmarks, and experimental results consistently demonstrate that our method achieves competitive performance.
